Friday, 5 January 2018

What a difference a JVM makes?

JDK 9 is out! But as a library writer, this means change, and change can go either way... Once we've satisfied that JCTools works with JDK9, what other observations can we make? Well, one of the main motivations for using JCTools is performance, and since the code has been predominantly tested and run with JDK8, is it even better with  JDK9? is it worse?

A JVM gets into trouble

I started my comparison (and we will not cover anything else cause the fuckers broke early) with the simplest queue, the SpscArrayQueue:
$ taskset -c 4-7 java -jar jctools-benchmarks/target/microbenchmarks.jar throughput.QueueThroughputBackoffNone -p qType=SpscArrayQueue -p qCapacity=131072 -jvmArgs="-Xmx1g -Xms1g" -i 5 -wi 15 -r 5 -w 1 -f 3

Oracle8u144:
offersFailed |    0.020 ± 0.020  ops/us
pollsFailed  |    0.093 ± 0.102  ops/us
pollsMade    |  361.161 ± 4.126  ops/us

Oracle9.0.1:
offersFailed |    0.065 ± 0.269  ops/us
pollsFailed  |    5.987 ± 2.788  ops/us
pollsMade    |   26.182 ± 2.273  ops/us

Some explanations on method and results:
  • This is running on my beefy laptop, lots of memory to spare, Xeon(R) CPU E3-1505M v6 @ 3.00GHz. I set the gov'nor to "userspace" and frequency to 2.8GHz to avoid CPU frequency scaling and turbo boosting while benchmarking. I use taskset above to pin the JVM process 1 logical core on each physical core, so no 2 threads share a core. Easier than disabling HT, and sufficient for this exploration.
  • The QueueThroughputBackoffNone benchmark is an all out throughput benchmark for queues where producers and consumers spin to offer/poll (with j.u.Queue semantics). Failures are recorded and the throughput observed is the rate of successful polls per microsecond(so millions per second if you prefer that figure). The benchmark is run with a single producer and single consumer thread, and is known to be sensitive to producer/consumer speed balance as contending on empty/full queue can lead to degraded performance. See some discussion of the benchmark here.
Back to the results. They is not good :(
Why would this happen? HOW COULD THIS HAPPEN!!!
I profiled this miserable little bastard only to find that on the offer side of there's a previously unobserved bottleneck:

What's happening? The whole point of SPSC is that there's no need for a strong memory barrier, no need for lock add or CAS, just some careful ordering (using putOrdered/lazySet). But here we got this LOCK ADDL (line 34) instruction spoiling all the fun and eating all the cycles (but the next line is getting all the blame, typical).
Where did it come from? I didn't put it there. Note that this is add 0 to the base of the stack(-0x40), which is how a StoreLoad barrier is implemented (see interesting post on barrier implementation details). But there's no volatile store in sight.

A JVM gets out of trouble

TBH, even before bothering to look at the assembly I reached for a big "usual suspect" for performance differences with JDK9: G1GC is the new default GC!
I quickly verified this theory (that G1GC fucked this up for me) by re-running the same benchmark on JDK9 with the JDK8 default GC (-XX:+UseParallelGC). Got similar results to JDK8. Awesome-ish:
Oracle9.0.1 (-XX:+UseParallelGC)
offersFailed |    0.059 ±  0.133  ops/us
pollsFailed  |    0.147 ±  0.251  ops/us
pollsMade    |  356.100 ±  6.305  ops/us

The bottleneck I've hit? it's G1GC card marking! I even wrote a post about it. G1GC write barrier is quite different from CMS/Parallel. You'll notice that the barrier discussed there is a little different from the one we see here. Times, they are a changing...
So... the G1GC write barrier is ANGRY. Why so angry?

What this means is that the g1_write_barrier_post is pissed because:
  • buffer (the backing array for the queue) and element (being offered) are from different regions
  • element is not null
  • card (for the buffer) is not young
Confusingly, when playing around with this issue I moved from a small machine (8gb) to a bigger one (32gb) and when running with a larger heap size specified the issue became allot less pronounced. This is because, if we set the size of the heap to 1g we get 1mb regions. If we set the size of the heap to 8g (set both mx and ms) however we get 4mb regions. We can demonstrate this is the issue by running again with G1 and setting the region size:
Oracle9.0.1 (-XX:+UseG1GC -Xms1g -Xmx1g -XX:G1HeapRegionSize=4m) offersFailed | 0.009 ± 0.033 ops/us pollsFailed | 0.252 ± 0.257 ops/us pollsMade | 183.827 ± 16.650 ops/us

So, still not so brilliant, but much better. This however implies that the improvement is due to the buffer and the element being allocated from the same region. This is purely a product of the benchmark and the chosen queue size, and is not typical of normal applications. It follows therefore (I think) that the behaviour we see here is quite likely to manifest pathologically for writes into long standing data structures, such as caches and queues. And indeed if we increase the queue capacity the situation reverts back.
All this messing around with JDK9 vs 8, runtime and GC impact comparison got me thinking of a couple of other regional collectors which might exhibit interesting behaviours here. Namely, Zing C4 and Shenandoah.

Put some Zing in that thing

I no longer work at Azul, so I had to get an Open Source developer licence, which was a breeze. I had to get some help with getting it all to work on Ubuntu 17.10, but 16.04 works off the bat (just follow the instructions). Starting with same parameters I got:
Zing8u17.12:
offersFailed |    0.016 ± 0.010  ops/us
pollsFailed  |    0.013 ± 0.022  ops/us
pollsMade    |  302.288 ± 3.602  ops/us

So Zing is 20% slower than Oracle ParallelGC here, but significantly better than G1 (either case). The JMH perfasm profiler does not work with Zing, though Azul does have a matching tool if you ask their support. To look at the profile I can either use ZVision, or Oracle Studio. I went with the latter, just because.
The profile is hard to read, so I might go into it another time, but the question that seems obvious is: "Is Zing slower than Oracle+ParallelGC because of read/write barrier costs?"
Zing after all is not a GC add-on to OpenJDK, but a completely different runtime+GC+compiler. In particular, Zing has recently switched from their old C2 like compiler (they forked paths many moons ago, but share a parent in Cliff Click), to an LLVM compiler which is now the default called Falcon. Testing that quickly by forcing Zing to use the C2 compiler yields the following results:
Zing8u17.12 (-XX:+UseC2):
offersFailed |    0.034 ±  0.055  ops/us
pollsFailed  |    0.010 ±  0.017  ops/us
pollsMade    |  198.067 ± 31.983  ops/us

OK, so Falcon is a big win for Zing here, that's not the issue. Can we take the read and write barriers out of the picture?
Sure we can! Zing supports 2 exciting GCs, you may have heard of the C4 GC, but Zing also supports the NoGC (it's very Zen), which is exactly what it sounds like. Running with no GC however may remove some positive effects GC has (e.g. not crashing when you've allocated more than your heap size and never collected, but also compacted relocated data), so we need to run NoGC with barriers, and NoGC with no barriers:
Zing8u17.12 (-XX:+GPGCNoGC -XX:+UseSVBs -XX:+UseLVBs):
offersFailed |    0.035 ± 0.053  ops/us
pollsFailed  |    0.022 ± 0.043  ops/us
pollsMade    |  302.433 ± 2.675  ops/us

So, turning GC off makes no difference at all for this benchmark. That's great, as we can consider the removal of the barriers in isolation:

Zing8u17.12 (-XX:+GPGCNoGC -XX:-UseSVBs -XX:-UseLVBs):
offersFailed |    0.099 ±  0.070  ops/us
pollsFailed  |    0.027 ±  0.048  ops/us
pollsMade    |  314.498 ± 17.872  ops/us

Note that we see:
  1. Some improvement when barriers are removed.
  2. Increased variance in results. This was due to large run to run variance which requires further digging.
So, while we can certainly see a difference here which is due to read/write barriers, that is not the whole story(maybe another day). My gut feeling is that Falcon is over inlining in this instance.

Oh Shenandoah

For Shenandoah I grabbed one of the builds provided by the benevolent Shipilev here. The build I got is this one: build 1.8.0-internal-jenkins_2017_11_12_03_35-b00
Running with same heap size, but remembering that these builds don't default to Shenandoah:
Shenandoah (-XX:+UseShenandoahGC):
offersFailed |    0.031 ± 0.024  ops/us
pollsFailed  |    0.009 ± 0.025  ops/us
pollsMade    |  143.165 ± 3.172  ops/us

Note that for Shenandoah there's not a massive imbalance between the offer/poll side, which indicates the issue is not pathological to one method or the other. For G1GC the problem was very clearly on the offer side. I had a peek at the assembly, but it's going to take me some time to get to grips with what's going on as it's a completely new set of tricks and quirks to look at. To get a high level view though, it's interesting to compare the HW counters for ParallelGC/Zing/G1/Shenandoah:

                      | PGC     |  Zing    | G1      | Shenandoah
pollsMade             | 368.055 |  303.039 | 194.538 | 140.236
CPI                   |   0.244 |    0.222 |   0.352 |   0.355 
cycles                |   7.629 |    9.319 |  14.764 |  19.998 
instructions          |  31.218 |   41.972 |  42.002 |  56.294 
branches              |   6.052 |   11.672 |   7.735 |   9.017 
L1-dcache-loads       |  10.057 |   10.936 |  14.320 |  26.707 
L1-dcache-load-misses |   0.067 |    0.149 |   0.162 |   0.100 
L1-dcache-stores      |   3.553 |    3.042 |   4.057 |   5.064 

  • This is a tiny workload, with 30-50 instructions per operation. I want to make it clear that it is very easy to have large differences between JVMs in such specific scenarios. This workload is all about loading and storing references in/out of an array. The data is all L1 resident, this is NOT A REPRESENTATIVE COMPARISON OF PERFORMANCE. If you want to know how these JVMs/GCs can help your application, run a representative workload with your application.
  • Seriously, let's not start a "my JVM/GC is better than thou" war here, OK people?
  • For simplicity of comparison I've run PGC/G1/Shenandoah out of the Shenandoah build that includes all 3. This makes for a simpler comparison as they all share the same compiler, but is not comparing with the relevant Oracle build (though it should be pretty much the same).
  • Zing has better CPI than PGC, but 10 more instructions per operation. These include 1 extra load, and 6 more branches. There are no branch misses in this workload, so the branches are just extra pressure on the branch predictor and more instructions. The 10 instructions difference translates into a 1.6 cycle difference, this is indicative of the success of the branch predictor in reducing the impact of the branches. These extra branches are the Zing LVB or read barrier. Each reference load costs at extra branch. Zing is doing less stores here, this is due to it's defensive approach to card marking.
  • G1 is given here with the good case (same region), as we already covered the bad case. We see G1 increasing the loads by 4 but using less branches than Zing, only 2 extra branches. These are related to the write barrier.  We also see an extra store.
  • Shenandoah is using 3 extra instructions, and 16 extra loads. This is more than I expected. Since Shenandoah is using a Brooks-Pointer you would expect an extra load for each reference load. If we estimate from the Zing branch increase that we have 6 reference loads, I'd expect to have 6 extra loads on the Shenandoah side. I assume the other loads are related to card marking but I will need to learn more about this collector to say anything. Shipilev has expanded on my rudimentary analysis here. His conclusion: "read and write barriers around Unsafe intrinsics are very active. C2 handling on Unsafe intrinsics uses CPUOrder membars a lot (http://hg.openjdk.java.net/shenandoah/jdk10/file/1819ee64325f/src/hotspot/share/opto/library_call.cpp#l2631),which may inhibit some barrier optimizations. The workload is also tied up in a very unlucky volatile-predicated loop that prevents barrier hoisting.
    Pending codegeneration improvements alleviate barrier costs even when they are not optimized."
Since I fully expect all the crusaders to get hot and bothered about the above I thought I'd throw in...


A Holy Graal!

Graal is the next gen compiler for HotSpot. Coming out of Oracle Labs and already running in Twitter production, I thought we should add another compiler dimension to this mix. To run Graal you can use: "-XX:+UnlockExperimentalVMOptions -XX:+EnableJVMCI -XX:+UseJVMCICompiler"
Comes included in the Java 9 package!!! So how does it do?

Oracle9.0.1 (-XX:+UseParallelGC -XX:+UnlockExperimentalVMOptions -XX:+EnableJVMCI -XX:+UseJVMCICompiler)

offersFailed |      ≈ 0          ops/us
pollsFailed  |    0.347 ± 0.441  ops/us
pollsMade    |   51.657 ± 3.568  ops/us

WHHHHHHHHYYYYYYYYYYYY!!!!!!!!
Why can't we have nice things? Well... It turns out the good folks at Oracle Labs have not yet implemented putOrdered as nicely as C2 and Falcon has, and have thus competely buggered up my SPSC queue :(


Summary: There Are Many JVMs In My Father's House

Variety is the spice of life as they say. In the narrow narrow usecase presented above we saw different JVMs/GCs/JITs throwing up all different behaviours. Some good, some less so. For this workload there's very little happening, no GC, no exciting vectorization opportunities, it's very limited. But, being limited has the benefit of simplicity and the opportunity to contrast.
Also note that by profiling and refining this code on C2 I have perhaps overfitted it to one compiler at the expense of others, I certainly had more opportunity to eliminate any issues for the C2+ParallelGC scenario....

I hope you enjoyed to tour, I encourage you to take these new friends home and play with them ;-).

12 comments:

  1. Shenandoah is also able to run without barriers, and that allows dissecting where the performance hit is coming from. It looks like both read and write barriers are responsible for the performance hit.

    The workload itself is heavy Unsafe user, which means it uses lots of C2 intrinsics, and there lays an interesting caveat: C2 likes to put CPUOrder barriers around those accesses, and that might bork barrier optimizations, like hoisting barriers from the loops, etc. With Shenandoah's heavier barriers, the hit might be bigger than for other collectors.

    See more here:
    http://cr.openjdk.java.net/~shade/shenandoah/jctools-QueueThroughputBackoffNone.txt

    ReplyDelete
    Replies
    1. Thanks! I'd be eager to read your annotation on the assembly for Shenandoah for this benchmark. Also, can you please note the flags for passive mode and enable/disable flags?
      The Unsafe usage is indeed a potential problem. I have a VarHandle (not committed yet) and AtomicFieldUpdater (run with qType=SpscAtomicArrayQueue) versions for this queue which I can test with, would that help eliminate that concern? running with the AtomicUpdater version shows no improvement.

      Delete
    2. Added the flags to the reference file above. I don't think VarHandle or A*FU would help, because they end up in the same C2 intrinsics.

      Delete
    3. Thanks! updated with your conclusion and link to analysis. Thanks for the pointers on the GC flags and read/write barriers assembly, it helps. I wish the compiler would annotate these things better. I note in your runs that PGC is still better than Shenandoah passive. I thought it should be as good or better?

      Delete
    4. PGC vs Shenandoah was off-the-wall run, notice the large error margins, that does not allow to make conclusive claim if either is better. Passive mode is supposed to be very close to PGC, sometimes faster, sometimes slower.

      I've found a few more codegen things that might be improved: for instance, SATB barrier that Shenandoah borrows from G1 seems to introduce more spills (and not XMM spills at that, because the leaf call to VM may modify them?). We would dig deeper, and this would improve.

      I wish the tests were in ns/op, to get the feel now tiny the workload really is, and why small differences in codegen add up to large performance hits.

      Delete
    5. Fair enough, I leave it to you to do a more rigorous comparison.

      Happy to hear the benchmark exposed some opportunities for here, would make a nice blog post to do a before/after and work through the tweaks. Is there a bug link for tracking this?

      The ops are indeed tiny. It is however a throughput test, and the aux counters make no sense at all in the avgt mode... There's a timing/cost benchmark you can look at if you prefer: QueueBurstCost

      Hope that helps.

      Delete
  2. Thanks for the post.
    Typo: "interesting to compare the HW counters for ParallelGC/ZingG1/Shenandoah". ZingG1?

    ReplyDelete
  3. Very nice read Nitsan.

    Is the performance equally bad when using Java 8 + G1 the same as Java 9 + G1? Or is there a regression?

    ReplyDelete
    Replies
    1. yes, the GC default change is the issue. JDK8 + G1 is same.

      Delete
  4. Should be easy to submit a PR with a fix for the Graal compiler.

    ReplyDelete
  5. Thanks for the detailed analysis.
    Graal recently updated some intrinsics for put/getOrdered methods.
    The fixes will land on RC3, in the meantime here's how to test/build Graal from source:
    https://gist.github.com/mukel/bc21a0acfe8c924fcc5fec1f480166cc


    ReplyDelete

Note: only a member of this blog may post a comment.