Psychosomatic, Lobotomy, Saw: 2018

Wednesday 11 July 2018

How Inlined Code Makes For Confusing Profiles

Inlining is a powerful and common optimization technique used by compilers.
But inlining combines with other optimizations to transform your code to such an extent that other tooling becomes confused, namely profilers.

Mommy, What is Inlining?

Inlining is the mother of all optimizations (to quote C. Click), and mothers are awesome as we all know. Inlining is simply the practice of replacing a method call with the method body. This doesn't sound like much of an improvement, but it's potentially a massive win for a couple of reasons:

Expanding the scope of other optimizations! calling a method with a constant parameter may just end up getting folded into a constant for instance. Most compilers take a view that you either know everything about a method, because it's the method you are compiling ATM, or you know next to nothing about it. Calling another method may end up blocking, or having some unknown side effect, or bring the world to an end. Inlining expands the horizons of all available optimizations into the inlined code. These benefits can be quite dramatic, eliminating much of the inlined code (callee) on the one hand and assisting in supporting assumptions about the compilation unit at hand (caller).
Very small methods call overhead may actually exceed the method cost. Importantly, and counter intuitively perhaps, the cost of calling a method is not fixed. The cost will depend for example on calling convention (whose job is it to keep track of registers), pre/post conditions (e.g. ordering and safepoints).

But Inlining is not without it's potential issues:

"code explosion". If I have a method of size 128b and that method cannot be reduced in any form by the context of it's caller, every time I inline it I increase my code size by 128b (a bit less, as I can remove the calling related prologue/epilogue). This may lead to a very bloated application, which will put pressure on my instruction cache as well as the JVM code cache.
Increased CPU pressure from the compiler itself, being challenged with larger and larger compilation units will lead to higher compile times, increased CPU usage, delayed compilation etc.
Increasingly complex compilation units may result in less optimal code than simpler ones. This is a real world limitation, though I would think in theory the more context the compiler has to work with, and assuming an infinite amount of cpu/memory/monkeys to throw at the problem, the better overall result we should get. In practice I have seen several cases where limiting inlining can improve results.

Picking the right boundaries is challenging, but luckily we got them bright compiler engineers to worry about that.

How Much Inlining Can We Do?

The amount of inlining a compiler does will depend on many factors. I'm not going to get very far if I try and explain the heuristics involved in great detail but some basic dimensions to the problem are:

Who is the callee? In Java almost all method calls are 'virtual'. This means that at runtime any number of implementations could possibly exist for a particular call. This is traditionally where you'd give up on inlining. To make inlining happen anyway the compiler must convert a virtual call into a static call before it can be inlined. The same code in a different runtime may induce different compiler decisions on this. JIT compilation allows the compiler to tailor the decision in a way that AOT compilation cannot (in a dynamic environment).
How big is the callee? Most compilers put some upper limit on how big a caller method can grow as well as how big can a callee method be and still get inlined. A very large caller will be blocked from further inlining and even a medium sized callee might be blocked from being inlined. What is considered big will depend on the compiler, configuration and the profile. Limitations on compilation time for JIT compilers also drive this decision, at least partially.
How deep is the already inlined stack? How many veils can the compiler pierce? What about recursion?

So, that's the general shape of the problem: Virtual x Size x Depth. And if you were gonna try and write an inlining strategy for a compiler you'd need allot more details, but we can keep going with this for now.

How Much Inlining Does Java Actually Do?

And by Java I mean the Oracle JDK 8u161. These things change whimsically from run to run, release to release, JVM to JVM etc, so stick a big YMMV on this shit.
We can have the JVM helpfully share with us compilation details via a bunch of flags, but most importantly for this feature we will need: -XX:+PrintInlining -XX:+PrintCompilation (output is a bit hard to reason about because concurrent logging...).
For the sake of demonstration, here's the output of the offer method of SpscArrayQueue from a throughput benchmark I run regularly for JCTools:

Some observations:

Tiny methods and some intrinsics are inlined even by the early compilation.
The offerSlowPath method which is not inlined by the first compilation is inlined by the C2 compiler at a second pass.

This is not a particularly challenging piece of code, but I picked it because it is easy enough for a human to follow. Small methods get inlined, even if they are nested a bit in other tiny methods.
There's a performance related nugget to dig here:

I split the offerSlowPath method with the intent of not having it inlined (or at least leaving this decision to the compiler). The compiler decided it is better off inlined. Who's right? has the compiler defeated my puny human brain? am I as worthless as my mother in law suspects? these questions will be answered some other day.

Where Do Profilers Come In?

In traditional Java profilers no attempt is made at distinguishing between inlined frames (calls to methods which have been inlined) and 'real' frames (real method calls on the stack). Telling the 2 apart can be useful:

Inlining is often a good thing, so failure to inline a method can be worth looking into. It's also instructive to see just how much the compiler can inline, and how many layers of distraction it must cut through to get to where the actual code you want to run is. Excessive "Manager/Abstract/Adapter/Proxy/Wrapper/Impl/Dimple/Pimpl" layering however can still prevent inlining and may also lead to erectile dysfunction, early onset dementia and being knifed by your colleagues.
Top of stack hot methods which are not inlined present obvious curios. Perhaps callers are too large, or the method itself?
Multiple implementors of same methods which are not inlined indicate a megamorphic callsite. If one of these is significantly hotter than the rest you might be able to special case for it (peel off the callsite with an instanceof check, effectively rolling your own conditional inlining).

But missing out on some good opportunities is not the issue I'm concerned with in this post. The point I'd like to get across is that Inlining makes an already confusing reality much much more confusing.

A Confusing Reality

CPUs execute instructions, these instructions are not our code but the product of a series of compilation and optimization steps. It follows that if we hope to learn where the bottleneck in our code/application is we must be able to translate back from CPU instructions (the product of C1/C2/Other JIT compilers) to bytecode (the product of javac/JVM language compiler/bytecode generating agent/Other) to Code coordinates (file + line or class + method).
This is not a straight forward mapping because:

Some instructions do not map to any bytecode: this can happen when the JVM generates code for it's own accounting purposes.
Some instructions map to many bytecodes: the compiler can be quite clever in substituting many operations with one specialized instruction, Or when the result of one computation can be reused elsewhere.
Some bytecodes do not map to Code: this can happen due to bytecode generation.
Sometimes the compiler just fails in it's book keeping :-(.

A modern JIT compiler employs a significant number of optimizations to any given method, allowing it to deliver us the speed we love, but making reverse engineering the instructions into code a challenge. The mapping rules employed by profiling tools trying to bridge this gap, and by the compilers trying to give them a usable index boil down to:

We assume instruction X translates to at most 1 bytecode location.
When a mapping does not exist, find the nearest instruction Y which has a mapping and use that.

This is already confusing, as the imperfect mapping, coupled with the compiler's license to reorder code combine to make "nearest instruction" possibly be "not nearest line of code".
The problem is made worse by the inaccuracy of the profiler observations themselves. It is a known issue, largely unsolved, that the reported instruction from which the mapping starts is often a few (in theory many) instructions away from the intended sample location. This can have a knock on effect on the ultimately blamed/mapped code. This issue was touched on in a previous post on AGCT profilers.
When looking at a single method this is often a pesky, but ultimately managable problem. Sure, the profiler pointed to line A, but any fool can see the problem is line B... silly little hobbit.

MOAR, BIGGER, CONFUSION

So, given that:

The instruction reported is a few instructions off.
The instructions do not map perfectly to lines of code.
The order of instructions does not have to match the order of the code.

We can see how even a small inaccuracy can translate into a large distance in lines of code between the cost we are trying to profile and where we end up. This is made exponentially worse by inlining.

Where we had one method, we now have many methods, from many classes. As the code gets pulled in it is subjected to the same compiler "remixing", but at an expanded horizon. For instance, a computation repeated in the caller and callee can now be done once. Considering that inlining 4-6 layers of indirection is very common the "LOC distance" between "near instructions" can end up being very confusing.

I should really elaborate with some clear example, but I do not have a repeatable experiment to hand and already stalled with this article for a few months while distracted elsewhere. An interesting example of how this could work out is available here. In this issue we have the wrong instruction being blamed for the cost of a cache-miss in a loop. The cache miss originates in the bowels of an inlined LinkedHashMap::get. The nearest instruction with a mapping however is from Field::binaryValue(line 441) it's easy to see how this would end up a very misleading profile if considered from a code perspective. When observed at the assembly level the instruction slip is much easier (as much as reading assembly is easy) to see through, but profiling on the assembly level is something very few people do and very few tools support.

Summary

Inlining is good, and profiling is good. But profiling modern compiled inlined code presents us challenges which are not easily solvable by looking at the common output of most tools. The more sense the compiler makes of our code, it's context, it's execution tree, the less obvious the mapping between instructions and code.
So, what can we do? I think we should have our tools approach the problem in 2 ways:

Fuzzier profiles: What would happen if instead of assigning cost to a single instruction (and it's code coordinates) we distributed the cost on a range of instructions? We could have more or less sophisticated models, but even a naive bell curve around the sampled instruction would bring in more context to the profile, and may help. This is hypothetical, as I am not aware of any tools which attempt this ATM.
Differentiate between real/inlined frames: This is something the JVM is certainly capable of doing, but most APIs for profiling do not support. AFAIK only Oracle Studio and perf/bcc + perf-map-agent support views on real vs. inlined. I tend to use the latter more than the former. There's talk of adding support for this into async-profiler.

Many thanks to V. Sitnikov for reviewing :-).

This post is one of a few on the topic of profiling, check the other ones out:

Friday 5 January 2018

What a difference a JVM makes?

JDK 9 is out! But as a library writer, this means change, and change can go either way... Once we've satisfied that JCTools works with JDK9, what other observations can we make? Well, one of the main motivations for using JCTools is performance, and since the code has been predominantly tested and run with JDK8, is it even better with JDK9? is it worse?

A JVM gets into trouble

I started my comparison (and we will not cover anything else cause the fuckers broke early) with the simplest queue, the SpscArrayQueue:
$ taskset -c 4-7 java -jar jctools-benchmarks/target/microbenchmarks.jar throughput.QueueThroughputBackoffNone -p qType=SpscArrayQueue -p qCapacity=131072 -jvmArgs="-Xmx1g -Xms1g" -i 5 -wi 15 -r 5 -w 1 -f 3

Oracle8u144:
offersFailed | 0.020 ± 0.020 ops/us
pollsFailed | 0.093 ± 0.102 ops/us
pollsMade | 361.161 ± 4.126 ops/us

Oracle9.0.1:
offersFailed | 0.065 ± 0.269 ops/us
pollsFailed | 5.987 ± 2.788 ops/us
pollsMade | 26.182 ± 2.273 ops/us

Some explanations on method and results:

This is running on my beefy laptop, lots of memory to spare, Xeon(R) CPU E3-1505M v6 @ 3.00GHz. I set the gov'nor to "userspace" and frequency to 2.8GHz to avoid CPU frequency scaling and turbo boosting while benchmarking. I use taskset above to pin the JVM process 1 logical core on each physical core, so no 2 threads share a core. Easier than disabling HT, and sufficient for this exploration.
The QueueThroughputBackoffNone benchmark is an all out throughput benchmark for queues where producers and consumers spin to offer/poll (with j.u.Queue semantics). Failures are recorded and the throughput observed is the rate of successful polls per microsecond(so millions per second if you prefer that figure). The benchmark is run with a single producer and single consumer thread, and is known to be sensitive to producer/consumer speed balance as contending on empty/full queue can lead to degraded performance. See some discussion of the benchmark here.

Back to the results. They is not good :(
Why would this happen? HOW COULD THIS HAPPEN!!!

I profiled this miserable little bastard only to find that on the offer side of there's a previously unobserved bottleneck:

What's happening? The whole point of SPSC is that there's no need for a strong memory barrier, no need for lock add or CAS, just some careful ordering (using putOrdered/lazySet). But here we got this LOCK ADDL (line 34) instruction spoiling all the fun and eating all the cycles (but the next line is getting all the blame, typical).
Where did it come from? I didn't put it there. Note that this is add 0 to the base of the stack(-0x40), which is how a StoreLoad barrier is implemented (see interesting post on barrier implementation details). But there's no volatile store in sight.

A JVM gets out of trouble

TBH, even before bothering to look at the assembly I reached for a big "usual suspect" for performance differences with JDK9: G1GC is the new default GC!

I quickly verified this theory (that G1GC fucked this up for me) by re-running the same benchmark on JDK9 with the JDK8 default GC (-XX:+UseParallelGC). Got similar results to JDK8. Awesome-ish:
Oracle9.0.1 (-XX:+UseParallelGC)
offersFailed | 0.059 ± 0.133 ops/us
pollsFailed | 0.147 ± 0.251 ops/us
pollsMade | 356.100 ± 6.305 ops/us

The bottleneck I've hit? it's G1GC card marking! I even wrote a post about it. G1GC write barrier is quite different from CMS/Parallel. You'll notice that the barrier discussed there is a little different from the one we see here. Times, they are a changing...

So... the G1GC write barrier is ANGRY. Why so angry?

What this means is that the g1_write_barrier_post is pissed because:

buffer (the backing array for the queue) and element (being offered) are from different regions
element is not null
card (for the buffer) is not young

Confusingly, when playing around with this issue I moved from a small machine (8gb) to a bigger one (32gb) and when running with a larger heap size specified the issue became allot less pronounced. This is because, if we set the size of the heap to 1g we get 1mb regions. If we set the size of the heap to 8g (set both mx and ms) however we get 4mb regions. We can demonstrate this is the issue by running again with G1 and setting the region size:
Oracle9.0.1 (-XX:+UseG1GC -Xms1g -Xmx1g -XX:G1HeapRegionSize=4m) offersFailed | 0.009 ± 0.033 ops/us pollsFailed | 0.252 ± 0.257 ops/us pollsMade | 183.827 ± 16.650 ops/us

So, still not so brilliant, but much better. This however implies that the improvement is due to the buffer and the element being allocated from the same region. This is purely a product of the benchmark and the chosen queue size, and is not typical of normal applications. It follows therefore (I think) that the behaviour we see here is quite likely to manifest pathologically for writes into long standing data structures, such as caches and queues. And indeed if we increase the queue capacity the situation reverts back.
All this messing around with JDK9 vs 8, runtime and GC impact comparison got me thinking of a couple of other regional collectors which might exhibit interesting behaviours here. Namely, Zing C4 and Shenandoah.

Put some Zing in that thing

I no longer work at Azul, so I had to get an Open Source developer licence, which was a breeze. I had to get some help with getting it all to work on Ubuntu 17.10, but 16.04 works off the bat (just follow the instructions). Starting with same parameters I got:
Zing8u17.12:
offersFailed | 0.016 ± 0.010 ops/us
pollsFailed | 0.013 ± 0.022 ops/us
pollsMade | 302.288 ± 3.602 ops/us

So Zing is 20% slower than Oracle ParallelGC here, but significantly better than G1 (either case). The JMH perfasm profiler does not work with Zing, though Azul does have a matching tool if you ask their support. To look at the profile I can either use ZVision, or Oracle Studio. I went with the latter, just because.
The profile is hard to read, so I might go into it another time, but the question that seems obvious is: "Is Zing slower than Oracle+ParallelGC because of read/write barrier costs?"
Zing after all is not a GC add-on to OpenJDK, but a completely different runtime+GC+compiler. In particular, Zing has recently switched from their old C2 like compiler (they forked paths many moons ago, but share a parent in Cliff Click), to an LLVM compiler which is now the default called Falcon. Testing that quickly by forcing Zing to use the C2 compiler yields the following results:
Zing8u17.12 (-XX:+UseC2):
offersFailed | 0.034 ± 0.055 ops/us
pollsFailed | 0.010 ± 0.017 ops/us
pollsMade | 198.067 ± 31.983 ops/us

OK, so Falcon is a big win for Zing here, that's not the issue. Can we take the read and write barriers out of the picture?
Sure we can! Zing supports 2 exciting GCs, you may have heard of the C4 GC, but Zing also supports the NoGC (it's very Zen), which is exactly what it sounds like. Running with no GC however may remove some positive effects GC has (e.g. not crashing when you've allocated more than your heap size and never collected, but also compacted relocated data), so we need to run NoGC with barriers, and NoGC with no barriers:
Zing8u17.12 (-XX:+GPGCNoGC -XX:+UseSVBs -XX:+UseLVBs):
offersFailed | 0.035 ± 0.053 ops/us
pollsFailed | 0.022 ± 0.043 ops/us
pollsMade | 302.433 ± 2.675 ops/us

So, turning GC off makes no difference at all for this benchmark. That's great, as we can consider the removal of the barriers in isolation:

Zing8u17.12 (-XX:+GPGCNoGC -XX:-UseSVBs -XX:-UseLVBs):
offersFailed | 0.099 ± 0.070 ops/us
pollsFailed | 0.027 ± 0.048 ops/us
pollsMade | 314.498 ± 17.872 ops/us

Note that we see:

Some improvement when barriers are removed.
Increased variance in results. This was due to large run to run variance which requires further digging.

So, while we can certainly see a difference here which is due to read/write barriers, that is not the whole story(maybe another day). My gut feeling is that Falcon is over inlining in this instance.

Oh Shenandoah

For Shenandoah I grabbed one of the builds provided by the benevolent Shipilev here. The build I got is this one: build 1.8.0-internal-jenkins_2017_11_12_03_35-b00

Running with same heap size, but remembering that these builds don't default to Shenandoah:

Shenandoah (-XX:+UseShenandoahGC):

offersFailed | 0.031 ± 0.024 ops/us

pollsFailed | 0.009 ± 0.025 ops/us

pollsMade | 143.165 ± 3.172 ops/us

Note that for Shenandoah there's not a massive imbalance between the offer/poll side, which indicates the issue is not pathological to one method or the other. For G1GC the problem was very clearly on the offer side. I had a peek at the assembly, but it's going to take me some time to get to grips with what's going on as it's a completely new set of tricks and quirks to look at. To get a high level view though, it's interesting to compare the HW counters for ParallelGC/Zing/G1/Shenandoah:

| PGC | Zing | G1 | Shenandoah

pollsMade | 368.055 | 303.039 | 194.538 | 140.236

CPI | 0.244 | 0.222 | 0.352 | 0.355

cycles | 7.629 | 9.319 | 14.764 | 19.998

instructions | 31.218 | 41.972 | 42.002 | 56.294

branches | 6.052 | 11.672 | 7.735 | 9.017

L1-dcache-loads | 10.057 | 10.936 | 14.320 | 26.707

L1-dcache-load-misses | 0.067 | 0.149 | 0.162 | 0.100

L1-dcache-stores | 3.553 | 3.042 | 4.057 | 5.064

This is a tiny workload, with 30-50 instructions per operation. I want to make it clear that it is very easy to have large differences between JVMs in such specific scenarios. This workload is all about loading and storing references in/out of an array. The data is all L1 resident, this is NOT A REPRESENTATIVE COMPARISON OF PERFORMANCE. If you want to know how these JVMs/GCs can help your application, run a representative workload with your application.
Seriously, let's not start a "my JVM/GC is better than thou" war here, OK people?
For simplicity of comparison I've run PGC/G1/Shenandoah out of the Shenandoah build that includes all 3. This makes for a simpler comparison as they all share the same compiler, but is not comparing with the relevant Oracle build (though it should be pretty much the same).
Zing has better CPI than PGC, but 10 more instructions per operation. These include 1 extra load, and 6 more branches. There are no branch misses in this workload, so the branches are just extra pressure on the branch predictor and more instructions. The 10 instructions difference translates into a 1.6 cycle difference, this is indicative of the success of the branch predictor in reducing the impact of the branches. These extra branches are the Zing LVB or read barrier. Each reference load costs at extra branch. Zing is doing less stores here, this is due to it's defensive approach to card marking.
G1 is given here with the good case (same region), as we already covered the bad case. We see G1 increasing the loads by 4 but using less branches than Zing, only 2 extra branches. These are related to the write barrier. We also see an extra store.
Shenandoah is using 3 extra instructions, and 16 extra loads. This is more than I expected. Since Shenandoah is using a Brooks-Pointer you would expect an extra load for each reference load. If we estimate from the Zing branch increase that we have 6 reference loads, I'd expect to have 6 extra loads on the Shenandoah side. I assume the other loads are related to card marking but I will need to learn more about this collector to say anything. Shipilev has expanded on my rudimentary analysis here. His conclusion: "read and write barriers around Unsafe intrinsics are very active. C2 handling on Unsafe intrinsics uses CPUOrder membars a lot (http://hg.openjdk.java.net/shenandoah/jdk10/file/1819ee64325f/src/hotspot/share/opto/library_call.cpp#l2631),which may inhibit some barrier optimizations. The workload is also tied up in a very unlucky volatile-predicated loop that prevents barrier hoisting.
```
Pending codegeneration improvements alleviate barrier costs even when they are not optimized."
```

Since I fully expect all the crusaders to get hot and bothered about the above I thought I'd throw in...

A Holy Graal!

Graal is the next gen compiler for HotSpot. Coming out of Oracle Labs and already running in Twitter production, I thought we should add another compiler dimension to this mix. To run Graal you can use: "-XX:+UnlockExperimentalVMOptions -XX:+EnableJVMCI -XX:+UseJVMCICompiler"

Comes included in the Java 9 package!!! So how does it do?

Oracle9.0.1 (-XX:+UseParallelGC -XX:+UnlockExperimentalVMOptions -XX:+EnableJVMCI -XX:+UseJVMCICompiler)

offersFailed | ≈ 0 ops/us

pollsFailed | 0.347 ± 0.441 ops/us

pollsMade | 51.657 ± 3.568 ops/us

WHHHHHHHHYYYYYYYYYYYY!!!!!!!!

Why can't we have nice things? Well... It turns out the good folks at Oracle Labs have not yet implemented putOrdered as nicely as C2 and Falcon has, and have thus competely buggered up my SPSC queue :(

Summary: There Are Many JVMs In My Father's House

Variety is the spice of life as they say. In the narrow narrow usecase presented above we saw different JVMs/GCs/JITs throwing up all different behaviours. Some good, some less so. For this workload there's very little happening, no GC, no exciting vectorization opportunities, it's very limited. But, being limited has the benefit of simplicity and the opportunity to contrast.
Also note that by profiling and refining this code on C2 I have perhaps overfitted it to one compiler at the expense of others, I certainly had more opportunity to eliminate any issues for the C2+ParallelGC scenario....

I hope you enjoyed to tour, I encourage you to take these new friends home and play with them ;-).