Psychosomatic, Lobotomy, Saw

How Inlined Code Makes For Confusing Profiles

2018-07-11T09:36:00.000+01:00

Inlining is a powerful and common optimization technique used by compilers.
But inlining combines with other optimizations to transform your code to such an extent that other tooling becomes confused, namely profilers.

Mommy, What is Inlining?

Inlining is the mother of all optimizations (to quote C. Click), and mothers are awesome as we all know. Inlining is simply the practice of replacing a method call with the method body. This doesn't sound like much of an improvement, but it's potentially a massive win for a couple of reasons:

Expanding the scope of other optimizations! calling a method with a constant parameter may just end up getting folded into a constant for instance. Most compilers take a view that you either know everything about a method, because it's the method you are compiling ATM, or you know next to nothing about it. Calling another method may end up blocking, or having some unknown side effect, or bring the world to an end. Inlining expands the horizons of all available optimizations into the inlined code. These benefits can be quite dramatic, eliminating much of the inlined code (callee) on the one hand and assisting in supporting assumptions about the compilation unit at hand (caller).
Very small methods call overhead may actually exceed the method cost. Importantly, and counter intuitively perhaps, the cost of calling a method is not fixed. The cost will depend for example on calling convention (whose job is it to keep track of registers), pre/post conditions (e.g. ordering and safepoints).

But Inlining is not without it's potential issues:

"code explosion". If I have a method of size 128b and that method cannot be reduced in any form by the context of it's caller, every time I inline it I increase my code size by 128b (a bit less, as I can remove the calling related prologue/epilogue). This may lead to a very bloated application, which will put pressure on my instruction cache as well as the JVM code cache.
Increased CPU pressure from the compiler itself, being challenged with larger and larger compilation units will lead to higher compile times, increased CPU usage, delayed compilation etc.
Increasingly complex compilation units may result in less optimal code than simpler ones. This is a real world limitation, though I would think in theory the more context the compiler has to work with, and assuming an infinite amount of cpu/memory/monkeys to throw at the problem, the better overall result we should get. In practice I have seen several cases where limiting inlining can improve results.

Picking the right boundaries is challenging, but luckily we got them bright compiler engineers to worry about that.

How Much Inlining Can We Do?

The amount of inlining a compiler does will depend on many factors. I'm not going to get very far if I try and explain the heuristics involved in great detail but some basic dimensions to the problem are:

Who is the callee? In Java almost all method calls are 'virtual'. This means that at runtime any number of implementations could possibly exist for a particular call. This is traditionally where you'd give up on inlining. To make inlining happen anyway the compiler must convert a virtual call into a static call before it can be inlined. The same code in a different runtime may induce different compiler decisions on this. JIT compilation allows the compiler to tailor the decision in a way that AOT compilation cannot (in a dynamic environment).
How big is the callee? Most compilers put some upper limit on how big a caller method can grow as well as how big can a callee method be and still get inlined. A very large caller will be blocked from further inlining and even a medium sized callee might be blocked from being inlined. What is considered big will depend on the compiler, configuration and the profile. Limitations on compilation time for JIT compilers also drive this decision, at least partially.
How deep is the already inlined stack? How many veils can the compiler pierce? What about recursion?

So, that's the general shape of the problem: Virtual x Size x Depth. And if you were gonna try and write an inlining strategy for a compiler you'd need allot more details, but we can keep going with this for now.

How Much Inlining Does Java Actually Do?

And by Java I mean the Oracle JDK 8u161. These things change whimsically from run to run, release to release, JVM to JVM etc, so stick a big YMMV on this shit.
We can have the JVM helpfully share with us compilation details via a bunch of flags, but most importantly for this feature we will need: -XX:+PrintInlining -XX:+PrintCompilation (output is a bit hard to reason about because concurrent logging...).
For the sake of demonstration, here's the output of the offer method of SpscArrayQueue from a throughput benchmark I run regularly for JCTools:

Some observations:

Tiny methods and some intrinsics are inlined even by the early compilation.
The offerSlowPath method which is not inlined by the first compilation is inlined by the C2 compiler at a second pass.

This is not a particularly challenging piece of code, but I picked it because it is easy enough for a human to follow. Small methods get inlined, even if they are nested a bit in other tiny methods.
There's a performance related nugget to dig here:

I split the offerSlowPath method with the intent of not having it inlined (or at least leaving this decision to the compiler). The compiler decided it is better off inlined. Who's right? has the compiler defeated my puny human brain? am I as worthless as my mother in law suspects? these questions will be answered some other day.

Where Do Profilers Come In?

In traditional Java profilers no attempt is made at distinguishing between inlined frames (calls to methods which have been inlined) and 'real' frames (real method calls on the stack). Telling the 2 apart can be useful:

Inlining is often a good thing, so failure to inline a method can be worth looking into. It's also instructive to see just how much the compiler can inline, and how many layers of distraction it must cut through to get to where the actual code you want to run is. Excessive "Manager/Abstract/Adapter/Proxy/Wrapper/Impl/Dimple/Pimpl" layering however can still prevent inlining and may also lead to erectile dysfunction, early onset dementia and being knifed by your colleagues.
Top of stack hot methods which are not inlined present obvious curios. Perhaps callers are too large, or the method itself?
Multiple implementors of same methods which are not inlined indicate a megamorphic callsite. If one of these is significantly hotter than the rest you might be able to special case for it (peel off the callsite with an instanceof check, effectively rolling your own conditional inlining).

But missing out on some good opportunities is not the issue I'm concerned with in this post. The point I'd like to get across is that Inlining makes an already confusing reality much much more confusing.

A Confusing Reality

CPUs execute instructions, these instructions are not our code but the product of a series of compilation and optimization steps. It follows that if we hope to learn where the bottleneck in our code/application is we must be able to translate back from CPU instructions (the product of C1/C2/Other JIT compilers) to bytecode (the product of javac/JVM language compiler/bytecode generating agent/Other) to Code coordinates (file + line or class + method).
This is not a straight forward mapping because:

Some instructions do not map to any bytecode: this can happen when the JVM generates code for it's own accounting purposes.
Some instructions map to many bytecodes: the compiler can be quite clever in substituting many operations with one specialized instruction, Or when the result of one computation can be reused elsewhere.
Some bytecodes do not map to Code: this can happen due to bytecode generation.
Sometimes the compiler just fails in it's book keeping :-(.

A modern JIT compiler employs a significant number of optimizations to any given method, allowing it to deliver us the speed we love, but making reverse engineering the instructions into code a challenge. The mapping rules employed by profiling tools trying to bridge this gap, and by the compilers trying to give them a usable index boil down to:

We assume instruction X translates to at most 1 bytecode location.
When a mapping does not exist, find the nearest instruction Y which has a mapping and use that.

This is already confusing, as the imperfect mapping, coupled with the compiler's license to reorder code combine to make "nearest instruction" possibly be "not nearest line of code".
The problem is made worse by the inaccuracy of the profiler observations themselves. It is a known issue, largely unsolved, that the reported instruction from which the mapping starts is often a few (in theory many) instructions away from the intended sample location. This can have a knock on effect on the ultimately blamed/mapped code. This issue was touched on in a previous post on AGCT profilers.
When looking at a single method this is often a pesky, but ultimately managable problem. Sure, the profiler pointed to line A, but any fool can see the problem is line B... silly little hobbit.

MOAR, BIGGER, CONFUSION

So, given that:

The instruction reported is a few instructions off.
The instructions do not map perfectly to lines of code.
The order of instructions does not have to match the order of the code.

We can see how even a small inaccuracy can translate into a large distance in lines of code between the cost we are trying to profile and where we end up. This is made exponentially worse by inlining.

Where we had one method, we now have many methods, from many classes. As the code gets pulled in it is subjected to the same compiler "remixing", but at an expanded horizon. For instance, a computation repeated in the caller and callee can now be done once. Considering that inlining 4-6 layers of indirection is very common the "LOC distance" between "near instructions" can end up being very confusing.

I should really elaborate with some clear example, but I do not have a repeatable experiment to hand and already stalled with this article for a few months while distracted elsewhere. An interesting example of how this could work out is available here. In this issue we have the wrong instruction being blamed for the cost of a cache-miss in a loop. The cache miss originates in the bowels of an inlined LinkedHashMap::get. The nearest instruction with a mapping however is from Field::binaryValue(line 441) it's easy to see how this would end up a very misleading profile if considered from a code perspective. When observed at the assembly level the instruction slip is much easier (as much as reading assembly is easy) to see through, but profiling on the assembly level is something very few people do and very few tools support.

Summary

Inlining is good, and profiling is good. But profiling modern compiled inlined code presents us challenges which are not easily solvable by looking at the common output of most tools. The more sense the compiler makes of our code, it's context, it's execution tree, the less obvious the mapping between instructions and code.
So, what can we do? I think we should have our tools approach the problem in 2 ways:

Fuzzier profiles: What would happen if instead of assigning cost to a single instruction (and it's code coordinates) we distributed the cost on a range of instructions? We could have more or less sophisticated models, but even a naive bell curve around the sampled instruction would bring in more context to the profile, and may help. This is hypothetical, as I am not aware of any tools which attempt this ATM.
Differentiate between real/inlined frames: This is something the JVM is certainly capable of doing, but most APIs for profiling do not support. AFAIK only Oracle Studio and perf/bcc + perf-map-agent support views on real vs. inlined. I tend to use the latter more than the former. There's talk of adding support for this into async-profiler.

Many thanks to V. Sitnikov for reviewing :-).

This post is one of a few on the topic of profiling, check the other ones out:

What a difference a JVM makes?

2018-01-05T17:27:00.000+00:00

JDK 9 is out! But as a library writer, this means change, and change can go either way... Once we've satisfied that JCTools works with JDK9, what other observations can we make? Well, one of the main motivations for using JCTools is performance, and since the code has been predominantly tested and run with JDK8, is it even better with JDK9? is it worse?

A JVM gets into trouble

I started my comparison (and we will not cover anything else cause the fuckers broke early) with the simplest queue, the SpscArrayQueue:
$ taskset -c 4-7 java -jar jctools-benchmarks/target/microbenchmarks.jar throughput.QueueThroughputBackoffNone -p qType=SpscArrayQueue -p qCapacity=131072 -jvmArgs="-Xmx1g -Xms1g" -i 5 -wi 15 -r 5 -w 1 -f 3

Oracle8u144:
offersFailed | 0.020 ± 0.020 ops/us
pollsFailed | 0.093 ± 0.102 ops/us
pollsMade | 361.161 ± 4.126 ops/us

Oracle9.0.1:
offersFailed | 0.065 ± 0.269 ops/us
pollsFailed | 5.987 ± 2.788 ops/us
pollsMade | 26.182 ± 2.273 ops/us

Some explanations on method and results:

This is running on my beefy laptop, lots of memory to spare, Xeon(R) CPU E3-1505M v6 @ 3.00GHz. I set the gov'nor to "userspace" and frequency to 2.8GHz to avoid CPU frequency scaling and turbo boosting while benchmarking. I use taskset above to pin the JVM process 1 logical core on each physical core, so no 2 threads share a core. Easier than disabling HT, and sufficient for this exploration.
The QueueThroughputBackoffNone benchmark is an all out throughput benchmark for queues where producers and consumers spin to offer/poll (with j.u.Queue semantics). Failures are recorded and the throughput observed is the rate of successful polls per microsecond(so millions per second if you prefer that figure). The benchmark is run with a single producer and single consumer thread, and is known to be sensitive to producer/consumer speed balance as contending on empty/full queue can lead to degraded performance. See some discussion of the benchmark here.

Back to the results. They is not good :(
Why would this happen? HOW COULD THIS HAPPEN!!!

I profiled this miserable little bastard only to find that on the offer side of there's a previously unobserved bottleneck:

What's happening? The whole point of SPSC is that there's no need for a strong memory barrier, no need for lock add or CAS, just some careful ordering (using putOrdered/lazySet). But here we got this LOCK ADDL (line 34) instruction spoiling all the fun and eating all the cycles (but the next line is getting all the blame, typical).
Where did it come from? I didn't put it there. Note that this is add 0 to the base of the stack(-0x40), which is how a StoreLoad barrier is implemented (see interesting post on barrier implementation details). But there's no volatile store in sight.

A JVM gets out of trouble

TBH, even before bothering to look at the assembly I reached for a big "usual suspect" for performance differences with JDK9: G1GC is the new default GC!

I quickly verified this theory (that G1GC fucked this up for me) by re-running the same benchmark on JDK9 with the JDK8 default GC (-XX:+UseParallelGC). Got similar results to JDK8. Awesome-ish:
Oracle9.0.1 (-XX:+UseParallelGC)
offersFailed | 0.059 ± 0.133 ops/us
pollsFailed | 0.147 ± 0.251 ops/us
pollsMade | 356.100 ± 6.305 ops/us

The bottleneck I've hit? it's G1GC card marking! I even wrote a post about it. G1GC write barrier is quite different from CMS/Parallel. You'll notice that the barrier discussed there is a little different from the one we see here. Times, they are a changing...

So... the G1GC write barrier is ANGRY. Why so angry?

What this means is that the g1_write_barrier_post is pissed because:

buffer (the backing array for the queue) and element (being offered) are from different regions
element is not null
card (for the buffer) is not young

Confusingly, when playing around with this issue I moved from a small machine (8gb) to a bigger one (32gb) and when running with a larger heap size specified the issue became allot less pronounced. This is because, if we set the size of the heap to 1g we get 1mb regions. If we set the size of the heap to 8g (set both mx and ms) however we get 4mb regions. We can demonstrate this is the issue by running again with G1 and setting the region size:
Oracle9.0.1 (-XX:+UseG1GC -Xms1g -Xmx1g -XX:G1HeapRegionSize=4m) offersFailed | 0.009 ± 0.033 ops/us pollsFailed | 0.252 ± 0.257 ops/us pollsMade | 183.827 ± 16.650 ops/us

So, still not so brilliant, but much better. This however implies that the improvement is due to the buffer and the element being allocated from the same region. This is purely a product of the benchmark and the chosen queue size, and is not typical of normal applications. It follows therefore (I think) that the behaviour we see here is quite likely to manifest pathologically for writes into long standing data structures, such as caches and queues. And indeed if we increase the queue capacity the situation reverts back.
All this messing around with JDK9 vs 8, runtime and GC impact comparison got me thinking of a couple of other regional collectors which might exhibit interesting behaviours here. Namely, Zing C4 and Shenandoah.

Put some Zing in that thing

I no longer work at Azul, so I had to get an Open Source developer licence, which was a breeze. I had to get some help with getting it all to work on Ubuntu 17.10, but 16.04 works off the bat (just follow the instructions). Starting with same parameters I got:
Zing8u17.12:
offersFailed | 0.016 ± 0.010 ops/us
pollsFailed | 0.013 ± 0.022 ops/us
pollsMade | 302.288 ± 3.602 ops/us

So Zing is 20% slower than Oracle ParallelGC here, but significantly better than G1 (either case). The JMH perfasm profiler does not work with Zing, though Azul does have a matching tool if you ask their support. To look at the profile I can either use ZVision, or Oracle Studio. I went with the latter, just because.
The profile is hard to read, so I might go into it another time, but the question that seems obvious is: "Is Zing slower than Oracle+ParallelGC because of read/write barrier costs?"
Zing after all is not a GC add-on to OpenJDK, but a completely different runtime+GC+compiler. In particular, Zing has recently switched from their old C2 like compiler (they forked paths many moons ago, but share a parent in Cliff Click), to an LLVM compiler which is now the default called Falcon. Testing that quickly by forcing Zing to use the C2 compiler yields the following results:
Zing8u17.12 (-XX:+UseC2):
offersFailed | 0.034 ± 0.055 ops/us
pollsFailed | 0.010 ± 0.017 ops/us
pollsMade | 198.067 ± 31.983 ops/us

OK, so Falcon is a big win for Zing here, that's not the issue. Can we take the read and write barriers out of the picture?
Sure we can! Zing supports 2 exciting GCs, you may have heard of the C4 GC, but Zing also supports the NoGC (it's very Zen), which is exactly what it sounds like. Running with no GC however may remove some positive effects GC has (e.g. not crashing when you've allocated more than your heap size and never collected, but also compacted relocated data), so we need to run NoGC with barriers, and NoGC with no barriers:
Zing8u17.12 (-XX:+GPGCNoGC -XX:+UseSVBs -XX:+UseLVBs):
offersFailed | 0.035 ± 0.053 ops/us
pollsFailed | 0.022 ± 0.043 ops/us
pollsMade | 302.433 ± 2.675 ops/us

So, turning GC off makes no difference at all for this benchmark. That's great, as we can consider the removal of the barriers in isolation:

Zing8u17.12 (-XX:+GPGCNoGC -XX:-UseSVBs -XX:-UseLVBs):
offersFailed | 0.099 ± 0.070 ops/us
pollsFailed | 0.027 ± 0.048 ops/us
pollsMade | 314.498 ± 17.872 ops/us

Note that we see:

Some improvement when barriers are removed.
Increased variance in results. This was due to large run to run variance which requires further digging.

So, while we can certainly see a difference here which is due to read/write barriers, that is not the whole story(maybe another day). My gut feeling is that Falcon is over inlining in this instance.

Oh Shenandoah

For Shenandoah I grabbed one of the builds provided by the benevolent Shipilev here. The build I got is this one: build 1.8.0-internal-jenkins_2017_11_12_03_35-b00

Running with same heap size, but remembering that these builds don't default to Shenandoah:

Shenandoah (-XX:+UseShenandoahGC):

offersFailed | 0.031 ± 0.024 ops/us

pollsFailed | 0.009 ± 0.025 ops/us

pollsMade | 143.165 ± 3.172 ops/us

Note that for Shenandoah there's not a massive imbalance between the offer/poll side, which indicates the issue is not pathological to one method or the other. For G1GC the problem was very clearly on the offer side. I had a peek at the assembly, but it's going to take me some time to get to grips with what's going on as it's a completely new set of tricks and quirks to look at. To get a high level view though, it's interesting to compare the HW counters for ParallelGC/Zing/G1/Shenandoah:

| PGC | Zing | G1 | Shenandoah

pollsMade | 368.055 | 303.039 | 194.538 | 140.236

CPI | 0.244 | 0.222 | 0.352 | 0.355

cycles | 7.629 | 9.319 | 14.764 | 19.998

instructions | 31.218 | 41.972 | 42.002 | 56.294

branches | 6.052 | 11.672 | 7.735 | 9.017

L1-dcache-loads | 10.057 | 10.936 | 14.320 | 26.707

L1-dcache-load-misses | 0.067 | 0.149 | 0.162 | 0.100

L1-dcache-stores | 3.553 | 3.042 | 4.057 | 5.064

This is a tiny workload, with 30-50 instructions per operation. I want to make it clear that it is very easy to have large differences between JVMs in such specific scenarios. This workload is all about loading and storing references in/out of an array. The data is all L1 resident, this is NOT A REPRESENTATIVE COMPARISON OF PERFORMANCE. If you want to know how these JVMs/GCs can help your application, run a representative workload with your application.
Seriously, let's not start a "my JVM/GC is better than thou" war here, OK people?
For simplicity of comparison I've run PGC/G1/Shenandoah out of the Shenandoah build that includes all 3. This makes for a simpler comparison as they all share the same compiler, but is not comparing with the relevant Oracle build (though it should be pretty much the same).
Zing has better CPI than PGC, but 10 more instructions per operation. These include 1 extra load, and 6 more branches. There are no branch misses in this workload, so the branches are just extra pressure on the branch predictor and more instructions. The 10 instructions difference translates into a 1.6 cycle difference, this is indicative of the success of the branch predictor in reducing the impact of the branches. These extra branches are the Zing LVB or read barrier. Each reference load costs at extra branch. Zing is doing less stores here, this is due to it's defensive approach to card marking.
G1 is given here with the good case (same region), as we already covered the bad case. We see G1 increasing the loads by 4 but using less branches than Zing, only 2 extra branches. These are related to the write barrier. We also see an extra store.
Shenandoah is using 3 extra instructions, and 16 extra loads. This is more than I expected. Since Shenandoah is using a Brooks-Pointer you would expect an extra load for each reference load. If we estimate from the Zing branch increase that we have 6 reference loads, I'd expect to have 6 extra loads on the Shenandoah side. I assume the other loads are related to card marking but I will need to learn more about this collector to say anything. Shipilev has expanded on my rudimentary analysis here. His conclusion: "read and write barriers around Unsafe intrinsics are very active. C2 handling on Unsafe intrinsics uses CPUOrder membars a lot (http://hg.openjdk.java.net/shenandoah/jdk10/file/1819ee64325f/src/hotspot/share/opto/library_call.cpp#l2631),which may inhibit some barrier optimizations. The workload is also tied up in a very unlucky volatile-predicated loop that prevents barrier hoisting.
```
Pending codegeneration improvements alleviate barrier costs even when they are not optimized."
```

Since I fully expect all the crusaders to get hot and bothered about the above I thought I'd throw in...

A Holy Graal!

Graal is the next gen compiler for HotSpot. Coming out of Oracle Labs and already running in Twitter production, I thought we should add another compiler dimension to this mix. To run Graal you can use: "-XX:+UnlockExperimentalVMOptions -XX:+EnableJVMCI -XX:+UseJVMCICompiler"

Comes included in the Java 9 package!!! So how does it do?

Oracle9.0.1 (-XX:+UseParallelGC -XX:+UnlockExperimentalVMOptions -XX:+EnableJVMCI -XX:+UseJVMCICompiler)

offersFailed | ≈ 0 ops/us

pollsFailed | 0.347 ± 0.441 ops/us

pollsMade | 51.657 ± 3.568 ops/us

WHHHHHHHHYYYYYYYYYYYY!!!!!!!!

Why can't we have nice things? Well... It turns out the good folks at Oracle Labs have not yet implemented putOrdered as nicely as C2 and Falcon has, and have thus competely buggered up my SPSC queue :(

Summary: There Are Many JVMs In My Father's House

Variety is the spice of life as they say. In the narrow narrow usecase presented above we saw different JVMs/GCs/JITs throwing up all different behaviours. Some good, some less so. For this workload there's very little happening, no GC, no exciting vectorization opportunities, it's very limited. But, being limited has the benefit of simplicity and the opportunity to contrast.
Also note that by profiling and refining this code on C2 I have perhaps overfitted it to one compiler at the expense of others, I certainly had more opportunity to eliminate any issues for the C2+ParallelGC scenario....

I hope you enjoyed to tour, I encourage you to take these new friends home and play with them ;-).

Java Flame Graphs Introduction: Fire For Everyone!

2017-02-14T17:14:00.001+00:00

FlameGraphs are superawesome. If you've never heard of FlameGraphs and want to dive straight in the deep end, you should run off and check out the many many good resources provided by Brendan Greg in his one stop shop page here. This post will give a quick intro and some samples to get you started with collecting profiles for all JVMs everywhere. I'm taking a slightly different tack then Brendan in presenting the topic, so if it turns out my explanations suck you should see if his make more sense.

What's the big deal?

If you've ever used a profiler to look at your code you will have seen 2 profile reports typically:

Flat profile: This is often presented as the "top X" methods/classes/packages where time (or samples, or ticks or whatever) is spent. This is useful as it immediately shows up common bottlenecks across your code, but these are shown out of context. Sometimes this is enough, but often in larger application profiles context is significant. This representation is very useful when a method with a high overall impact is called from many callsites, making each callsite cheap but the method itself significant.
Tree profile: This profile will present you with a call tree where each method is a node with a total and self time quantity. The self measure implies the amount of time spent in the method itself(the amout of samples in which the method is the leaf), and total is for the total number of samples in which it shows up (leaf and node).

The problem with the tree view is that it is very unpleasant to navigate. Click click click click and as the stack deepens it becomes harder to look at and ingest. Enter FlameGraphs.

FlameGraph represents a tree profile in a single interactive SVG where:

The x-axis shows the stack profile population, sorted alphabetically (it is not the passage of time), and the y-axis shows stack depth. Each rectangle represents a stack frame. The wider a frame is is, the more often it was present in the stacks. The top edge shows what is on-CPU, and beneath it is its ancestry.

Like most visualisations, it makes sense when you see it rather than explain it. Let start with data sets we can easily grasp and see what they look like.

Synthetic Samples For Starters

For instance, what does a single stack sample look like? The FlameGraphs SVG generating script takes as it's input a "collapsed stacks" file which has a dead simple format, frames separated by semi-colons followed by the number of times this stack was sampled. Here's a dummy handwritten example of a single sample file (call it sample.cstk):
main;0;1;2;3;4;5;6;7;8;9;10 1

We can feed this to the flames (now is a good time to clone this repo and try shit out):
flamegraph.pl single-stack.cstk > single-stack.svg

Here's what a single stack trace looks like:
Please Use modern Browser(e.g. recent chrome?) to see this SVG!

But a single stack trace is just one data point, not a profile. What if we had 1M samples of this same stack?
Please Use modern Browser(e.g. recent chrome?) to see this SVG!

Well.. it would look pretty much the same, but if you hover over it will tell you it got 1M samples. It looks the same because we still have 100% the same stack for the whole profile. It's the same profile.
"BUT!" I hear you say, "But, colours?". Yes the colors mean nothing at this point, but will become interesting later. The default colour palate is red and the choice of colors is random, hence the different colour selection changes from run to run. Just forget colors for a second, OK?
Right, next we want to look at a set of samples with a few more stacks:
main;0;1;2;3;4;5 1
main;0;1;2;3;4;5;6 2
main;0;1;2;3;4;5;6;7 3
main;0;1;2;3;4;5;6;7;8 4
main;0;1;2;3;4;5;6;7;8;9 5

Please Use modern Browser(e.g. recent chrome?) to see this SVG!

Now you can also get a feel for what clicking around does and how you zoom in and out.
By now I hope you get the picture for how a bunch of stacks and frequencies look with a simple data sets. Last synthtic example to look at has several root frames and a little more varied stacks. Lets try this:
main;a;1;2;3;4;5 1
main;c;1;2;3;4;5;6;7 4
main;a;1;2;3;4;5;6 2
main;c;1;2;3;4;5;6 4
main;c;1;2;3;4;5;6;8 4
main;b;1;2;3;4;5;6;7 3
main;b;1;2;3;4;5;6;8 3
main;b;1;2;3;4;5;6;9 3
main;d;1;2;3;4;5;6;7;8;9 5

And we get this:
Please Use modern Browser(e.g. recent chrome?) to see this SVG!

We see here that stacks are sorted alphabetically and ordered from left to right. The ordering has nothing to do with the order in the file. The collapsed stacks format is itself an aggregation with no view on timing. So the order from left to right is only about merging, not time or anything else. We can see that stacks which share a common parent naturally aggregate under that parent. The width of each frame is it's relative total-time share. It's self-time share is it's top exposure, or how much of it is not covered by it's callees, the frames on top of it.

Tree View vs Flames

Now that we got the hang of this flamy thing, lets take a look at the same profile using 2 presentations. The venerated tree-view and this new hipsterish whatever flame thing. The following is a profile collected using honest-profiler for a netty benchmark:

In typical workflow I step further and further into the hot stack, but this pushes out the big picture out of my view. I would now have to go back up and fold it to see what hides under other hot branches in the tree. It's a familiar and annoying experience if you've ever used a profiler with this kind of view. The problem is that Java class and method names are typically long, and stacks are quite deep. This is a simple application and I quickly run out of room.
Here's the FlameGraph for the same profile (I chose green, because later it makes sense):
Please Use modern Browser(e.g. recent chrome?) to see this SVG!

NOTE: I made all the flame graphs in this post narrow so they fit the layout. They don't have to be this narrow. You can set the width to whatever you like, I used "--width=700" for the graphs in this post.
We can see the root frames quickly break out to main activities, with the deep netty stack now visible upfront. We can click and zoom easily. I also find the search ability which colors matching strings useful to highlight class/package participation in the profile. Prominent flat-tops indicate hot leaf methods we might want to look at.
It's perhaps a matter of taste, but I love it. I've been using flame graphs for a while and they are pure genius IMO. I find the image itself is intuitively more approachable, and with the ability to quickly zoom in/out and search I can usually quickly work through a profile without losing sight of the full picture.
So how do you get one?

It's Bob! yay?

Everybody Gets A FlameGraph!

Yes, even you poor suckers running JDK 1.3 on Windows XP! I don't recommend this method of profiling if you have a more recent JVM, or if your JVM supports AsyncGetCallTrace, but if your deployment is stuck in the past you can still be in on this. This is because ALL JVMs must support JVMTI and AFAIK allow you to hit them with jstack/JVisualVM/hprof. It's a terrible way to profile, there's allot of overhead, and usually you can find a better way, but this is universally available. Collecting a sample via jstack is (a terrible idea) quite easy. Just find the pid of the process you want to profile using jps and then do something like:
for i in {1..100}; do
jstack <pid> >> iloveoldshit.jstk;
sleep 0.1;
done

And Bob is your relative (which is a good thing apparently).

Once you've collected a large enough sample for your application you can go on and feed flame graphs:

cat iloveoldshit.jstk | ./stackcollapse-jstack.pl | ./flamegraph.pl --color=green > jstack-flames.svg

And you get:

Please Use modern Browser(e.g. recent chrome?) to see this SVG!

This is the same benchmark from before, but different profile with the safepoint bias. You can compare the two by scrolling up and down. OR you can use FlameGraphs to diff the 2, in a moment.
FlameGraphs supports converting jstack output into collapsed stacks (as above). Efforts exist on GitHub to convert the hprof format which JVisualVM produces (as well as other profilers) into collapsed stack format.
So for all my poor readers, struggling under the oppression of JVMs for which better means to spill out stack samples do not exist, I got your backs, you too can be looking at flame graphs!
But seriously, just move on.

Level Up: AsyncGetCallTrace or JFR

Now, if you want a better profiler, which does not result in bringing your application to safepoint and pausing ALL your threads at each sample AND your are either running a 1.6 or higher JDK (OpenJDK/Oracle/recent Zing) on Linux/Mac/BSD you can use Honest-Profiler to collect your profile. If you got Oracle JDK 1.7u40 or higher you can use Java Flight Recorder (if you use it in production you need to pay for the licence). These profilers rely on AsyncGetCallTrace to record the Java stack safely on interrupt(from a signal handler, not at safepoint).
To collect with Honest-Profiler I start my JVM with the following parameter:
-agentpath:$HONEST_PROFILER_HOME/liblagent.so=host=localhost,port=4242,logPath=$PWD/netty.hpl
Then, when I feel the time is right, I can start and stop the profile collection
echo start | nc localhost 4242
echo stop | nc localhost 4242
To convert the binary format into collapsed stacks I need to use a helper class in the honest-profiler.jar:
java -cp $HONEST_PROFILER_HOME/honest-profiler.jar com.insightfullogic.honest_profiler.ports.console.FlameGraphDumperApplication netty.hpl netty.cstk
I can then feed the flamegraph script the collapsed stacks file and get the result which we've already seen.
To convert JFR recordings to flame graphs see this post. But remember children, you must pay Oracle if you use it in production, or Uncle Larry might come for you.

Bonus: Diff Profiles!

A nice benefit of having 2 different profilers produce (via some massaging) a unified format for flame graphs is that we can now diff profiles from 2 profilers. Not something that is generally useful, granted. But diffing profiles is an established capability in many profilers, and is usually used to do a before/after comparison. Flame Graphs support this via the same visualization. Once you have converted your profiles into the collapsed stacks format you can produce a diff file and graph it:

./difffolded.pl -n A.cstk B.cstk | ./flamegraph.pl > A-B-diff.svg

Diffing the Honest-Profiler and jstack profiles gives us the following:

Please Use modern Browser(e.g. recent chrome?) to see this SVG!

The white squares are not interesting, the red/pink squares highlight the delta of self samples as a percentage of total samples (not so intuitive). I admit it may seem a tad confusing at first, but at least it draws your eyes to the right places. More on differential flame graphs here.
Note: to make this diff work I had to shave off the thread names from the jstack collected collapsed stacks file.

Further Bonus: Icicle Graph

Some times the bottleneck is not a particular call stack plateau, but rather a particular method being called from many call sites. This kind of bottleneck will not show well in a flame graph as the different stacks with similar tops will be split and may not stand out. This is really where a flat profile is great, but we can also flip the flame graph view to highlight the top method merging:

cat netty.cstk | ./flamegraph.pl --reverse --invert --color=green > netty-icicles.svg

I've filtered the stacks to show only a relevant portion:

Please Use modern Browser(e.g. recent chrome?) to see this SVG!
This is not dissimilar to functionality offered by other profiler GUIs which allow the drill down direction to start from hot methods and into their callers in a tree view presentation.

Level UP++: Java Perf Flame Graphs FTW!

If you are so fortunate as to:

Be running OpenJDK/Oracle 1.8u60 or later(this functionality is coming to Zing in a near future release, fingers crossed)
Running on Linux
Got permissions to make this run

You can get amazing visibility into your system by using a combination of:

Java with: -XX:+PreserveFramePointer (also recommended -XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints so unfolded frame are more accurate)
perf-map-agent: This attachable agent dumps a symbol mapping file for all runtime generated code in the JVM, enabling perf to correctly resolve addresses to methods. You'll need to clone and build.
perf: You'll need permissions and you'll need to install it. I assume you are root of you own machine for simplicity.

With the above working you can collect a perf profile of your Java process(by itself or as part of whole system). This results in a perf.data file and a perf-<pid>.map file in your /tmp folder. You can then proceed to generate a collapsed stack profile from that file, the simplest way to get this going is by using a script packed with perf-map-agent:

perf-java-flames <pid>

This will ask for password as it needs to sudo a few things. Be the sudo you want to see in the world. After a suspenseful wait of 15 seconds you'll get this:

Please Use modern Browser(e.g. recent chrome?) to see this SVG!

Note: Netty deploys it's native lib into tmp, loads it and deletes it, which means perf gets lost looking for it. I deleted it from the benchmarks jar and loaded it directly using LD_LIBRARY_PATH to resole this visibility issue. It doesn't make a huge difference, but in case you try this out.

The green frames are Java, and the rest are all the magic which happens to help Java along. Here's what happened:

Red frames are C library or Kernel code. We can see that the socket writes in the original profile actually go into native land. We now have further visibility down the hole. Importantly this illustrates where hot Java methods are in fact not Java methods at all and so looking at their Java code for optimisation ops is futile.
Yellow frames are C++. We can see the call chain leading into the interpreter and ultimately into compiled Java code.
Green frames are Java. BUT if you compare the previously presented profiles will this one you will notice there are some intermediate frames missing here. This is because the frames in this profile are "real" frames, or rather they map to stack frame. Inlined methods in Java do not have their own stack frames, so we can't see them (for now, we'll sort this out in a second). Further more, the keen observer will notice the familiar "Thread.run" bottom of the stack is missing, replaced by the "interpreter". As is often the case, the run method did not get compiled in this benchmark so it is not a proper compiled method for which we have a mapping. Methods in the interpreter are opaque in this profile.
Some stacks are broken, which can be confusing. In the example above we can see the 13.8 unknown chunk which leads to some JVM activities, but also to some Java code. More on that later.

So, it would seem that we have gained something in visibility into the native/OS/JVM CPU utilization, but lost allot of information we had in the Java side. When is this still useful:

This profile is super useful if you are writing Netty and trying to workout which system calls you end up with from your JNI code, or where time is spent in that code (netty implements it's own native epoll selector, very cool). If you are writing an application which utilizes JNI libraries this profile will give you visibility across the divide. The alternative here would be to use 2 profilers and try and correlate them. Solaris Studio also offers some help here, I will one day write a post on Solaris Studio.
This in not a good example of a profile dominated by JVM threads, but in many profiles the GC activity will show up. This is very useful, as GC and compiler CPU utilization can get in the way of application threads using the available CPU. A Java only profiler leaves you to correlate GC/compilation logs and application profile to figure out who ate the pie. It's also an interesting view into which part of the GC is to blame.
Some JVM intrinsics are confusing to AsyncGetCallTrace, and invisible to safepoint profilers. The biggest culprit I see is array copy. Array copies will show up as failed samples on AGCT profilers (unless, like JMC they just fail to tell you about failed samples all together). They show up in this profile (search above for arraycopy), but only a tiny bit.
This profile can be collected system wide, allowing you to present a much wider picture and expose machine wide issues. This is important when you are looking at machine level analysis of your application to improve configuration/setup.
In depth view of OS calls can inform your configuration.

'look at me! look at me now!' said the cat.

'with a cup and a cake on the top of my hat!

I can hold up TWO books!

I can hold up the fish!

and a little toy ship!

and some milk on a dish!

and look!

I can hop up and down on the ball!

but that is not all!

oh, no. That is not all...

Bonus: Inlined Frames! Threads! COLOR!

We can win back the inlined frames information by asking perf-map-agent to create a more detailed map file with inlining data. This leads to larger map files, but should be worth it.
You can further tweak the command line to color kernel frames differently and control sample duration and frequency. And while we're a-tweakin' lets also have threads info.
Here's what you run:

PERF_COLLAPSE_OPTS="--kernel --tid" PERF_RECORD_FREQ=99 PERF_RECORD_SECONDS=10 PERF_MAP_OPTIONS=unfoldall perf-java-flames <pid>

And the graph now looks like this:

Please Use modern Browser (e.g. recent chrome?) to see this SVG!

The Java frames are now green and aqua. Aqua frames are inlined frames and green are "real". This information is not presented at all by most profilers, and is pretty inaccessible in others. Here we can instantly see some interesting inlining challenges in the tall towers of same sized frames. The compiler inlines through many, but maybe eventually gives up, maybe there's something to be won by simplifying the abstraction here?
Thread id is added as a base frame. This is helpful in this particular example because there are only 2 interesting threads and I very much want to see this split. It also helps bring back some broken stacks into the fold. Now I can tell these frames belong to the Netty epoll thread. Yay.
Orange frames are kernel frames.
Having the thread id highlights that the none Java frames on the left are from a variety of threads. If we had more GC/Compiler work happening this may become interesting.
Profiling a large application with thread pools this separation by thread may not be what you want... but sometimes it is very helpful, like above. In this benchmark I have a thread generating load and a thread I want to profile, so telling them apart works great. At the moment there's no mapping of threads to thread names, but in future we may be able to easily group thread pools for more meaningful views.

Bonus: Hack It All Up

There's very little code in perf-map-agent, and the scripts would take you 5 minutes to read through. You don't have to use the scripts, you can write your own. You can add or enhance features, it's easy to participate or customize. Dig in, have fun :-)

The FlameGraph scripts are nice and tidy, and the pipeline separation of [profile -> collapsed stacks -> graph] means you can read through them and tweak as you like the bits you care about without caring too much about the rest. While working on this post I played with the diff presentation a bit. It was my first ever interaction with Perl, and I'm not so very bright, and I managed to get what I wanted. Surely someone as fine as yourself can do better.

If you look at Brenden's updates page you'll see many many people are jumping in and tweaking and sharing and making funky things. Go for it!

Summary And Credits

So you can have flame graphs, all of you. And you can feed these with inputs from several sources, each with their own set of pros and cons. It makes a great tool in your tool box and may give you that extra perspective you are missing in your profiling.

There's a few people who deserve mention in the context of the tools above, look them up, they all helped make it happen:

Brendan Gregg: FlameGraphs proud daddy. Brendan has written allot about FlameGraphs, work through his posts and you'll learn plenty.
Johannes Rudolph: Author and maintainer of perf-map-agent.
Jake Lucianni: Contributed flamegraphs support for inlined frames.
Richard Warburton: Author and maintainer of honest-profiler.

Thanks for reading :-) next up some JVM profile analysis

What do Atomic::lazySet/AtomicFieldUpdater::lazySet/Unsafe::putOrdered* actually mean?

2016-12-20T08:33:00.001+00:00

Paved with well intended definitions it is.

lazySet/putOrdered (or an ordered store) was added as a bit of a rushed/non-commital afterthought after the JMM was done, so it's description is subject to many debates on the mailing lists, stack overflow and watercoolers the world over. This post merely tries to provide a clear definition with references to relevant/reputable sources.

Definition:

An ordered store is a weakened volatile store[2][5]. It prevents preceding stores and loads from being reordered with the store[1][3][6], but does not prevent subsequent stores and loads from being reordered with it[2][4].

If there was a JMM cookbook entry for ordered store defining it with barriers in mind it would seem that the consensus is that ordered stores are preceded by a StoreStore AND a LoadStore barrier[4][6].

Ordered store is practically the same as a C++ memory_release_store[5][7].

References:
[1] Original bug:

"lazySet provides a preceding store-store barrier (which is either a no-op or very cheap on current platforms), but no store-load barrier"
See here: http://bugs.java.com/bugdatabase/view_bug.do?bug_id=6275329

[2] java.util.concurrent docs:

"lazySet has the memory effects of writing (assigning) a volatile variable except that it permits reorderings with subsequent (but not previous) memory actions that do not themselves impose reordering constraints with ordinary non-volatile writes."
See here: https://docs.oracle.com/javase/8/docs/api/?java/util/concurrent/package-summary.html

[3] JMM cookbook: Defining barriers meaning here: http://g.oswego.edu/dl/jmm/cookbook.html

[4] concurrency-interest Q&A with Doug Lea, October 2011:

"[Ruslan]:... If it happens (== we see spin-wait loop finished) -- does it mean,that all writes preceding lazySet are also done, committed, and visible to thread 2, which finished spin-wait loop?

[Doug]: Yes, although technically, you cannot show this by reference to the Synchronization Order in the current JLS.
...
lazySet basically has the properties of a TSO store"
See here: http://cs.oswego.edu/pipermail/concurrency-interest/2011-October/008296.html

The discussion is REALLY worth reading, involving Hans, Doug, Vitaly, Ruslan and other such respectable members of this excellent mailing list. Go on, I'll wait.

The discussion on that thread concludes the following is required:
...
LoadStore + StoreStore
st [Y],X // store X into memory address Y
...

Outcome: Stores before and after are now prevented from floating across the barrier. Loads before the barrier are also prevented from floating down. Later loads are free to float up. Note that st may in theory be delayed indefinitely, certainly other loads and stores are allowed to float up between it and the barrier.

[5] concurrency-interest Q&A with Aleksey Shipilev, May 2016:

"putOrdered is a release in disguise, most of the C++11 std::atomic(...,

mem_order_release) reasoning applies here."

"acquire/release are the relaxations from the usual volatile

rules -- while producing happens-before-s, they drop from total

synchronization order, thus breaking sequential consistency."

And adds some fine examples:

"Safe publication still works:


                       int x; volatile int y;
-----------------------------------------------------------------------
    put(x, 1);                   |  r1 = get{Acquire|Volatile}(y);
    put{Release|Volatile}(y, 2); |  r2 = get(x);

(r1, r2) = (2, 0) is forbidden.

But anything trickier that requires sequential consistency fails. IRIW
fails, because no consistent write order observed by all threads. Dekker
fails, because release stores followed by loads may or may not be
visible in program order:

                     volatile int x; volatile int y;
-----------------------------------------------------------------------
    putRelease(x, 1);            |    putRelease(y, 1);
    r1 = getAcquire(y);          |    r2 = getAcquire(x);

(r1, r2) = (0, 0) is allowed. Forbidden if all ops are volatile.


Safe construction still does not work (even for volatiles!):

                                A global;
-----------------------------------------------------------------------
    A a = <alloc>;                  |  A a = global;
    put{Release|Volatile}(a.x, 1);  |  r1 = get{Acquire|Volatile}(a.x);
    global = a;                     |

(r1) = (0) is allowed."

See here: http://cs.oswego.edu/pipermail/concurrency-interest/2016-May/015104.html

[6] concurrency-interest Q&A with Aleksey Shipilev, March 2016:

"> int a, b;

> 
> boolean tryLock() {
>     UNSAFE.putOrdered(a, 1); // Declare intent.
> 
>     // No StoreLoad here as store is not volatile.
> 
>     if (UNSAFE.getVolatile(b) == 1)) {
>         // Reset intent and return false;
>     }
> 
>     return true;
> }

Even in the naive barrier interpretation that usually gives stronger
answers, you have:

 [LoadStore|StoreStore]
 a = 1;

 r1 = b;

 [LoadLoad|LoadStore]"

See here: http://cs.oswego.edu/pipermail/concurrency-interest/2016-March/015037.html

[7] C++ memory_order_release definition:

"A store operation with this memory order performs the release operation: no reads or writes in the current thread can be reordered after this store. All writes in the current thread are visible in other threads that acquire the same atomic variable (see Release-Acquire ordering below) and writes that carry a dependency into the atomic variable become visible in other threads that consume the same atomic (see Release-Consume ordering below)."

See here: http://en.cppreference.com/w/cpp/atomic/memory_order

Many thanks to A. Shipilev, M. Thompson, D. Lawrie and C. Ruslan  for reviewing, any remaining errors are their own and they shall be most severely reprimanded for them.

Linked Array Queues, part 2: SPSC Benchmarks

2016-12-13T09:36:00.001+00:00

JCTools has a bunch of benchmarks we use to stress test the queues and evaluate optimizations.

These are of course not 'real' workloads, but serve to highlight imperfections and opportunities. While it is true that an optimization might work in a benchmark but not in the real world, a benchmark can work as a demonstration that there are at least circumstances in which it does work. All measurement is imperfect, but not as imperfect as claims made with no fucking evidence whatsoever, so here goes.

How do these linked-array queues fare in the benchmarks? what can we learn here?

The linked array queues are a hybrid of the array and linked queues. So it seems reasonable that we should compare them to both SpscArrayQueue and SpscLinkedQueue. We should also consider how the queues differ and see if we can flush out the differences via the benchmarks.
If you crack under the pressure of boring details, skip to the summary, do not stop at interlude, do not collect a cool drink or get praise, just be on yer fuckin' merry way.

Setup:

Benchmarks are run on a quiet server class machine:

Xeon processor(Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz): 2 CPUs x 12 cores x 2 threads (HT)
CentOS
Oracle JDK8u101
All benchmarks are run taskset to cores on the same numa node, but such that threads cannot share the same physical core.
Turbo boost is off, the scaling governor is userspace and the frequency is fixed.
The code is on github

Throughput benchmark: background and method

A throughput benchmark for queues is a tricky fucker. In particular the results change meaning depending on the balance between consumer and producer:

If the consumer is faster than the producer we are measuring empty queue contention (producer/consumer hitting the same cache line for elements in the queue, perhaps sampling each other index). Empty queues are the expected state for responsive applications.
If the producer is faster than the consumer we are measuring full queue contention, which may have similar issues. For some queues which optimize for the healthy assumption that queues are mostly empty this may be a particularly bad place to be.
If the producer and consumer are well balanced we are testing a streaming use case which offers the most opportunities for progress for both consumer and producer. This should yield the best performance, but for most applications may not be a realistic scenario at all.

The JCTools throughput benchmark does not resolve these issues. It does however report results which give us an idea of poll/offer failure rates which are in turn indicative of which state we find ourselves in.
A further challenge in managed runtime environments, which is unrelated to queues, is that garbage generating benchmarks will have GC state accumulate across measurement iterations. The implication is that each iteration is measuring from a different starting state. Naturally occurring GCs will leave the heap in varying states depending on the point at which they hit. We can choose to either embrace the noise in the measurement as an averaging of the cost/overhead of garbage or allocate a large enough heap to accommodate a single iteration worth of allocation and force a full GC per iteration, thus resetting the state per iteration. The benchmarks below were run with 8g heap and a GC cycle between iterations.
The benchmark I run here is the no backoff version of the throughput benchmark where failure to offer/poll makes no attempt at waiting/yielding/tapping of foot and just tries again straight away. This serves to maximize contention and is not a recipe for happiness in real applications.
JMH parameters common to all runs below:

-gc true -> GC cycle between iterations
-jvmArgs="-Xmx8g -Xms8g" -> 8g heap
-i 10 -r 1 -> 10 measurement iterations, 1 second each
-wi 20 -w 1 -> 20 warmup iterations, 1 second each
-f 5 -> five forks each to expose run to run variance

Throughput benchmark: baseline(JMH params: -bm thrpt -tu us)

Here's some baseline results, note the unit is ops/us equal to millions of ops per second:
SpscArrayQueue (128k capacity)
offersFailed 0.005 ± 0.008 ops/us
offersMade 252.201 ± 1.649 ops/us
pollsFailed 0.009 ± 0.008 ops/us
pollsMade 252.129 ± 1.646 ops/us

So the SpscArrayQueue is offering great throughput, and seems pretty well balanced with failed offers/polls sort of cancelling out and low compared to the overall throughput.

SpscLinkedQueue

offersFailed ≈ 0 ops/us
offersMade 14.711 ± 5.897 ops/us
pollsFailed 12.624 ± 8.281 ops/us
pollsMade 14.710 ± 5.896 ops/us

For the SpscLinkedQueue we have no failed offers, since it's an unbounded queue. We do see a fair amount of failed polls. We expect the polls to be faster than the offers as offering pays for allocation of nodes on each element (24b overhead per element), while the poll simply leaves it to the GC to toss it all away.

With this baseline we would expect linked arrays queues performance to be somewhere between the 2 data points above. Unlikely to hit the highs of the preallocated array queue, but hopefully much better than a linked queue.

Throughput benchmark: growable

So assuming we let it grow to 128k, how does the SpscGrowableArrayQueue perform in this benchmark and how much does the initial size impact the performance? CNK here is the initial buffer size. The buffer will double in size when offer fills up a buffer until we hit the max size buffer.

CNK Score Error Units
16 offersFailed 0.006 ± 0.006 ops/us
16 offersMade 183.720 ± 0.450 ops/us
16 pollsFailed 0.003 ± 0.001 ops/us
16 pollsMade 183.592 ± 0.450 ops/us
128 offersFailed 0.003 ± 0.006 ops/us
128 offersMade 184.236 ± 0.336 ops/us
128 pollsFailed 0.003 ± 0.001 ops/us
128 pollsMade 184.107 ± 0.336 ops/us
1K offersFailed 0.001 ± 0.003 ops/us
1K offersMade 183.113 ± 1.385 ops/us
1K pollsFailed 0.003 ± 0.001 ops/us
1K pollsMade 182.985 ± 1.385 ops/us
16K offersFailed 0.007 ± 0.006 ops/us
16K offersMade 181.388 ± 5.380 ops/us
16K pollsFailed 0.004 ± 0.001 ops/us
16K pollsMade 181.259 ± 5.380 ops/us

Under constant streaming pressure the Growable queue will keep growing until either full sized buffer is allocated (very likely) or a smaller buffer in which the throughput is sustainable is found (unlikely for this benchmark as all it takes is a single spike). If that was the case we would have no failing offers. Either way we expect transition to the last buffer to be a short phase after which the algorithm is very similar to SpscArrayQueue and no further allocations happen. The number of resizing events is small, as the buffer doubles each time (so log2(capacity/initial size), e.g. for initial capacity 16k: 16k -> 32k -> 64k -> 128k).
You may consider the slow down from SpscArrayQueue large at roughly 25%, but I don't think it too bad considering that with the throughputs in question we are looking at costs in the single digit nanoseconds where every extra instruction is going to show up (back of envelope: 250 ops/us -> ~4ns per offer/poll vs 180 ops/us -> ~5ns. 1ns = ~3 cycle ~= 12 instructions or 1 L1 load).

Throughput benchmark: chunked

For Chunked we see the expected increase in throughput as we increase the chunk size (CNK is the fixed chunk size, the max size is 128K):

CNK Score Error Units
16 offersFailed ≈ 0 ops/us
16 offersMade 43.665 ± 0.892 ops/us
16 pollsFailed 9.160 ± 0.519 ops/us
16 pollsMade 43.665 ± 0.892 ops/us
128 offersFailed ≈ 10⁻⁴ ops/us
128 offersMade 151.473 ± 18.786 ops/us
128 pollsFailed 0.380 ± 0.331 ops/us
128 pollsMade 151.443 ± 18.778 ops/us
1K offersFailed 0.309 ± 0.375 ops/us
1K offersMade 149.351 ± 14.102 ops/us
1K pollsFailed 0.112 ± 0.125 ops/us
1K pollsMade 149.314 ± 14.120 ops/us
16K offersFailed ≈ 10⁻⁸ ops/us
16K offersMade 175.408 ± 1.563 ops/us
16K pollsFailed 0.038 ± 0.031 ops/us
16K pollsMade 175.394 ± 1.563 ops/us

Note the decline in throughput for smaller chunks is matched with an increase in poll failures indicating that the consumer is becoming faster than the producer as the chunk grows smaller requiring more frequent allocations by the produce.
Note also that even with 16 slot chunks this option is ~3 times faster than the linked alternative.
Under constant streaming pressure the Chunked queue will be pushed to it's maximum size, which means the producer will be constantly allocating buffers. The producer resize conditions are also slightly trickier and require sampling of the consumer index. The consumer will be slowed down by this sampling, and also slowed down by jumping to new buffers. This problem will be worse as more resizing happens, which is a factor of chunk size.
The benefit of larger chunks will cap out at some point, you could explore this parameter to find the optimum.
An exercise to readers: run the benchmark with the JMH GC profiler and compare the queues. Use it to verify the assumption that Growable produces a bounded amount of garbage, while Chunked continues to churn.
Max throughput is slightly behind Growable.

The main take aways for sizing here seem to me that tiny chunks are bad, but even with small/medium chunks you can have pretty decent throughput. The right size for your chunk should therefore depend on your expectations of average traffic on the one hand and desirable size when empty.

Throughput benchmark: unbounded

For unbounded we see the expected increase in throughput as we increase the chunk size (CNK is the chunk size, the max size is infinity and beyond):

CNK Score Error Units
16 offersFailed ≈ 0 ops/us
16 offersMade 56.315 ± 7.563 ops/us
16 pollsFailed 10.823 ± 1.611 ops/us
16 pollsMade 56.315 ± 7.563 ops/us
128 offersFailed ≈ 0 ops/us
128 offersMade 135.119 ± 23.306 ops/us
128 pollsFailed 1.236 ± 0.851 ops/us
128 pollsMade 131.770 ± 21.535 ops/us
1K offersFailed ≈ 0 ops/us
1K offersMade 182.922 ± 3.397 ops/us
1K pollsFailed 0.005 ± 0.003 ops/us
1K pollsMade 176.208 ± 3.221 ops/us
16K offersFailed ≈ 0 ops/us
16K offersMade 177.586 ± 2.929 ops/us
16K pollsFailed 0.031 ± 0.038 ops/us
16K pollsMade 176.884 ± 2.255 ops/us

The 16 chunk size is ~4 times faster than the linked list option, as chunk size increases it gets more efficient.
Max throughput is slightly behind growable.
Why is Chunked faster than Unbounded on 128 chunks, but slower on 1K? I've not looked into it, it's taken long enough to write this bloody post as it is. How about you check it out and let me know?

Throughput benchmark: summary

Growable queue performs well regardless of initial size for this case.
For chunked and unbounded the chunk size has definite implications on throughput. Having said that throughput is very good even for relatively small chunks.
Note that the results for the same benchmark without a GC cycle between iterations were very noisy. The above result intentionally removes the variance GC induces by forcing GC and allowing a large heap. The GC impact of linked array queues when churning will likely be in increasing old generation pressure as the overflow chunks are likely to have been promoted before they get collected. This is assuming a load where overflow is not that frequent and other allocation is present.

Interlude

Go ahead, grab a beer, or a coffee, a mojito perhaps(Norman/Viktor, go on), or maybe order a large Pan Galactic Gargle Blaster, you've earned it. I never thought you'd read this far, it's a tad dry innit? Well, it's not fun writing it either, but we're getting there, just need to look at one more benchmark...

Burst "cost"/latency benchmark: background and method

The burst cost benchmark is a more stable workload than the throughput one. The producer sends a burst of messages to a consumer. The consumer signals completion when the last message in the burst has arrived. The measurement is from first message sent and arrival of last message observed from the producer thread. It's a 'latency' benchmark, or rather an estimate of average communication cost via the particular thread. It's got bells on. It's a friend, and it's a companion, it's the only product you will ever need, follow these easy assembly instructions it never needs ironing.

This is, I think, a better evaluation of queue characteristics than the throughput benchmark for most applications. Queue starts empty, is hit with a burst of traffic and the burst is drained. The cost measured is inclusive of return signal latency, but as scenarios go this is not too far fetched. Calling this queue latency is a damn sight better than PRETENDING THE BLOODY INVERSE OF THROUGHPUT IS LATENCY. <deep breath>

Same machine and JMH parameters used as above. All the measurements below are average time per operation in nanoseconds. The benchmark code can be found here.

Burst Cost benchmark: baseline

Testing first with SpscArrayQueue and SpscLinkedQueue to establish the expected baseline behaviour, BRST is the size of the burst:

SpscArrayQueue (128k capacity)

BRST Score Error Units

1 284.709 ± 8.813 ns/op

10 368.028 ± 6.949 ns/op

100 914.150 ± 11.424 ns/op

Right, sending one message has the overhead of cache coherency making data visible to another core. Sending 10/100 messages we can see the benefits of the SpscArrayQueue in allowing consumer and producer to minimize cache coherency overhead per element. We see a satisfying drop in cost per element as the burst size grows (the per element cost is the cost of the burst divided by the number of elements sent, so we see here: 1 -> 284, 10 -> 36, 100 -> 9), but this DOES NOT MEAN THE FRIGGIN' LATENCY IS BLOOMIN' DOWN TO 9ns WHEN WE SEND 100 MESSAGES.

SpscLinkedQueue

BRST Score Error Units

1 378.043 ± 7.536 ns/op

10 1675.589 ± 44.496 ns/op

100 17036.528 ± 492.875 ns/op

For the linked queue the per element overheads are larger, as well as the cost of scanning through a linked list rather than an array as we poll data out. The gap between the it and SpscArrayQueue widens as the burst size grows. The linked queue fails to make the most of the batching opportunity offered by slack in the queue in other words.

Burst Cost benchmark: growable

We expect the growable queue to grow to accommodate the size of the burst. The eventual buffer size will be a tighter fit around the burst size, which in theory might be a benefit as the array is more likely to fit in cache. Let's spin the wheel (CNK is the initial chunk size, the max size is 128K):

BRST CNK Score Error Units

1 16 327.703 ± 11.485 ns/op

1 128 292.382 ± 9.807 ns/op

1 1K 275.573 ± 6.230 ns/op

1 16K 286.354 ± 6.980 ns/op

10 16 599.540 ± 73.376 ns/op

10 128 386.828 ± 10.016 ns/op

10 1K 376.295 ± 8.009 ns/op

10 16K 358.096 ± 6.107 ns/op

100 16 1173.644 ± 28.669 ns/op

100 128 1152.241 ± 40.067 ns/op

100 1K 966.612 ± 9.504 ns/op

100 16K 951.495 ± 12.425 ns/op

We have to understand the implementation to understand the results here, in particular:

The growable queue buffer will grow to accommodate the burst in a power of 2 sized array. This in particular means that when the burst size is 100 the buffer for the initially smaller 16 chunk queue is also 128. The delta between the 2 configurations becomes marginal once that happens as we see in the 100 burst which forces the initially size 16 element buffer to grow to 128.
The queue tries to probe ahead within a buffer to avoid reading on each element.The read ahead step is a 25% of the buffer size. The smaller the buffer the more often we need to probe ahead (e.g. for a 16 element buffer we do this every 4 elements). This overhead is visible in the smaller buffers.
A burst which manages to fill more than 75% will fail to read ahead with the long probe described above and fall back to reading a single element ahead. This implies that buffers that fit too snugly to the burst size will have worse performance.
When the buffers are sufficiently large the costs closely match the costs observed for the SpscArrayQueue. Yay!

Burst Cost benchmark: chunked

For Chunked we see a slight increase in base cost and a bummer when the burst size exceeds the chunk size (CNK is the chunk size, the max size is 128K):

BRST CNK Score Error Units

1 16 311.743 ± 11.613 ns/op

1 128 295.987 ± 5.468 ns/op

1 1K 281.308 ± 8.381 ns/op

1 16K 281.962 ± 7.376 ns/op

10 16 478.687 ± 52.547 ns/op

10 128 390.041 ± 16.029 ns/op

10 1K 371.067 ± 7.789 ns/op

10 16K 386.683 ± 5.276 ns/op

100 16 2513.226 ± 38.285 ns/op

100 128 1117.990 ± 14.252 ns/op

100 1K 969.435 ± 10.072 ns/op

100 16K 939.010 ± 8.173 ns/op

Results are overall similar to the growable, what stands out is:

If the chunk is too little to accommodate the burst we see a large increase to cost. Still, comparing this to the SpscLinkedQueue shows a significant benefit. Comparing to the growable version we see the sense in perhaps letting the queue grow to a better size as a response to bursts.
If the chunk is large enough to accommodate the burst behaviour closely matches SpscGrowableArrayQueue. Yay!

Burst Cost benchmark: unbounded

Final one, just hang in there.

BRST CNK Score Error Units

1 16 303.030 ± 11.812 ns/op

1 128 308.158 ± 11.064 ns/op

1 1K 286.379 ± 6.027 ns/op

1 16K 282.574 ± 10.886 ns/op

10 16 554.285 ± 54.468 ns/op

10 128 407.350 ± 11.227 ns/op

10 1K 379.716 ± 9.357 ns/op

10 16K 370.885 ± 12.068 ns/op

100 16 2748.900 ± 64.321 ns/op

100 128 1150.393 ± 26.355 ns/op

100 1K 1005.036 ± 14.491 ns/op

100 16K 979.372 ± 13.369 ns/op

What stands out is:

If the chunk is too little to accommodate the burst we see a large increase to cost. Still, comparing this to the SpscLinkedQueue shows a significant benefit.
If the chunk is large enough to accommodate the burst and make the most of probing ahead the costs closely resemble the SpscArrayQueue for larger bursts. Yay!

Burst Cost benchmark: summary

We see a pretty much expected result for these queues, which is to say that on the fast path they are the same and therefore if the fast path dominates they show the same costs as a plain SpscArrayQueue, which is good news. When chunks are too small and we have to allocate new chunks we start to see overheads.

A more subtle observation here is that smaller buffers have some drawbacks as the slow path of the producer code is more likely to be executed. This reflects correctly the empty queue assumption that the JCTools queues rely on, but broken assumptions are... well... broken, so the cost goes up.

A further consideration here for smaller buffer is the hot/cold structure of the code. It is intended that the producer code inlines the "offer" hot path, but as the cold path is rarely run it will fail to inline it. This is an intentional inlining fail. Inlining the cold path will make the "offer" larger and allot more complex, making the compilers job harder and may result in worse resulting code. When we run with burst/buffer sizes which systematically violate the hot/cold assumption we can trigger a bad inlining decision. This can be worked around by marking the cold methods as "dontinline" using the CompileCommand option or the compiler oracle file.

Mmmm... this is boring :(

Yes... Nothing too surprising happened here, I did not emerge from the lab with my coat on fire, these things happen. One anecdote worth sharing here is that I originally run the benchmarks with only 2 threads allocated to the JVM, this resulted in noisier measurement as I effectively under provisioned the JVM with CPUs for compilation/GC or any OS scheduling contention/interrupts. When running on a 2 core laptop this is a reasonable compromise to fix the cross core topology of the benchmark, but on a server class machine it is easy enough to provision the same topology with more CPUs.

Next part will feature the astounding extension of these queues to the MPSC domain and will be far more interesting! I promise.

Linked Array Queues, part 1: SPSC

2016-10-24T07:57:00.000+01:00

When considering concurrent queues people often go for either:

An array backed queue (circular array/ring buffer etc.)
A linked list queue

The trade off in the Java world seems to be that array backed queues offer better throughput, but are always bounded and allocated upfront, and linked queues offer smaller footprint when empty, but worse throughput and higher overhead per element when full. Linked queues are also often unbounded.

In JCTools we have the above duality, but in later versions introduced a hybrid queue which attempts to offer a middle ground between the 2 extremes above. This hybrid is the linked array queue:

Queue is backed by one or more arrays of references.
As long as the queue can fit in a single array no further arrays are allocated.
When empty the queue can naturally shrink to a single backing array.

For the SPSC case this has already been done in C++ Fast Flow with their uSPSC queues. In Java there are no other implementations that I know of (please let me know if I missed anything).

In implementing these queues I have relied heavily on the feedback and support of @akarnokd, @pcholakov and others who contributed fixes, filed bugs, and so forth. Thanks guys!

3 variations on linked array queues have been explored in JCTools:

Chunked: Each backing array is a fixed chunk size. The queue is bounded.
Unbounded: Each backing array is a fixed chunk size. The queue is unbounded.
Growable: Chunk size doubles every time until the full blown backing array is used. The queue is bounded.

Hot Path offer/poll

The queues all share the same polling logic and on the fast path share the same offer logic as well:

If you've not read JCTools code before, or maybe you've forgotten, here's the convention:

sv/lvXXX: Store/Load Volatile, same as a volatile field write/read
sp/lpXXX: Store/Load Plain, same as a normal field write/read
soXXX: Store Ordered, same as an AtomicXXX.lazySet

This code will need reconsidering in a post-JDK9 world, some other time perhaps.

So what's the deal:

As long as we are not passed the producerLimit, keep writing.

If we have passed it go to slow path (where the money be)

As long as there's stuff in the queue, read it.

Unless it's JUMP, in which case read through to next array.

The general idea here being that the common case is small and simple. This should have the following effects:

offer/poll code is small enough to inline when hot.
offerColdPath/newBufferPoll are set up to either not inline or, when inlined be out of band code blocks. This should keep size on the hot path small and help the compiler focus on more important things.
offer/poll should perform similar to the SpscArrayQueue in the common case.

NOTE: The producer publishes element then index, but the consumer does the reverse. This ensures a consistent view on the consumer side where the following assertion must hold:
!queue.isEmpty() => queue.poll() != null

NOTE: In some early versions the new array was entered instead of the JUMP constant marker. This required an instanceof check for each element loaded and a compromise to either not allow Object[] to be an element of the queue or introduce some wrapper class. Comparing to a constant turned out to be much simpler and faster.

Cold Path poll

The consumer has hit a JUMP, which indicates the producer has linked a new array to this one. The new array is the last element of the current array. We go to newBufferPoll:

The consumer therefore has to:

Load new array from the last element of old array.
Null out reference from old to new (prevent nepotism).
Adjust consumer view on buffer and mask.
Poll (remove element and progress counter) from new buffer.

Note that peek similarly refreshes view of buffer on hitting the JUMP marker. This goes against the standard spirit on peek which is a view only operation.

Cold Path offer: Unbounded

This method is implemented differently for each of the use cases, unbounded is the simplest:

In the unbounded case we care only about our ability to make progress inside the current buffer:

Probe inside buffer to see if we have 'many' elements to go before full. If buffer is mostly empty (this is the case for most applications most of the time), than we have successfully saved ourselves loading elements from the queue before writing in. A successful probe will update the producer limit and write to the queue.
Probe failed, we check if we can write a single element. Remember we always need one free slot to JUMP with, so we look at the slot following the one we are on. If the next slot is empty we write to the current one, but we don't update the limit.
A single slot remains in the buffer. We allocate a new buffer and link to it.

This is not a large departure from the SpscArrayQueue, I leave the comparison as an exercise to the delightful reader.

Cold Path offer: Chunked

With chunked linked array queues we have a fixed chunk size, but an overall bound on the size. This complicates matters because on top of the buffer level limits put on the producer we must also consider the queue level limitation. In particular there might be space available in the current producer buffer, while the queue is in fact full. Here's the implementation:

Similar to the above but the difference lies in negotiating the buffer vs. queue limit.

Cold Path offer: Growable

The growable queue is similar in spirit to an ArrayList as it doubles it's backing array capacity when a buffer is full. This adds an interesting bit of information to the game, since:

If we are not on the last buffer, we've not hit the queue limit,
If we're on the last buffer, and so is the consumer, we can revert to just checking for null in the buffer.
If we're on the last buffer, and the consumer isn't, we need to hang tight and let it pass. It's a temporary state.

The code for handling this is rather ugly and long, but since you've put up with the rest:

The lookAheadStep is dynamically adjusted as the buffer grows, and also acts as an indicator for the transition period which the producer is on the last buffer and the consumer is trailing. It's a mess to look at, but sadly performs better than a neater alternative which builds on the Chunked variant... General idea:

lookAheadStep is positive => we are either not on last buffer, or on it for both consumer and producer => it is enough to consider the elements in the producer buffer to determine if the queue is full. In particular if the buffer is full then we must resize unless we are on the last buffer in which case we are full. Note that we allow using the last buffer to the last element, since we don't need a JUMP anymore.
lookAheadStep is negative => we are waiting for consumer to get to the last buffer. We use the lookAheadStep to indicate how far behind the consumer is.

It's not complex, just messy, and if you got an elegant representation please ping me with your suggestions.

Performance?

GODDAMN it! this is the time consuming part! I've benchmarked on a few setups, but not kept track or clear records. I'll need to do it all again, might as well be another post since nobody reads this far into these things anyhow. La la la la la, performance is great, la la la la la, if I look now I might be disappointed, la la la la la, this is better than benchmarking, la la la la la.

TO BE CONTINUED...

4 Years Blog Anniversary

2016-10-21T13:26:00.001+01:00

This has been a very slow year blogging wise... too much work, travel, and family. 4 years ago I aimed for 2 posts a month. I nearly hit it in 2013, failed by a little in 2014, and went on to not meeting at all the last 2 years...
Coincidently I also have a delightfully busy 2.5 year old running around, I suspect she has something to do with how little time I have. My eldest has developed a terrible addiction to reading so I hardly see her these days, can't be blame for this one.
Since I started the blog I've also embarked on developing and maintaining JCTools, speaking at conferences, and starting a local JUG in Cape Town. Sharing of one kind breeds another, but you don't seem to gain bonus hours in the day. Hmmm.
If I sound like I'm making excuses it's because I'm still feeling bad about missing that 2 posts a month goal.
Despite my decreased output, blog traffic has steadily increased teaching me a thing or 2 about search and web economics. I hope to write more in the next year as I plan to give conferences a rest for the next 6-8 months, will see how it goes.
In any case, thanks for reading :-)

Fixing Coordinated Omission in Cassandra Stress

2016-07-18T09:20:00.000+01:00

I have it from reliable sources that incorrectly measuring latency can lead to losing ones job, loved ones, will to live and control of bowel movements. Out of my great love for fellow humans I have fixed the YCSB load generator to avoid the grave danger that is Coordinated Omission, this was met with great joy and I now gone on to fixup Cassandra Stress in similar vein. The fix is now merged into trunk, so expect to have it with the next Cassandra release (or just build it from source NOW).
Before we start on the how and what, let us take a step back and consider the why:

Coordinated Omission is a term coined by my esteemed colleague Gil Tene to describe a common measurement anti pattern where the measuring system inadvertently coordinates with the system under measurement (watch Gil's How NOT to Measure Latency talk). This results in greatly skewed measurements in commonly used load generators, and has led to long discussions on Mechanical Sympathy mailing list etc. I've given my own explanation in a previous post, so go and have a read if you need a refresher on the subject.
Cassandra Stress (CS) is a tool which comes bundled with Cassandra to enable load testing of a given cluster. It allows the user to specify their own queries and schemas and is the prefered tool for Cassandra benchmarking as it gives better access to Cassandra features etc. CS allows 2 testing modes: throughput(default) or 'limit' where a given number of operations per second are thrown at the cluster (the fix discussed here does away with limit and replaces it with throttle/fixed, read on).

So coordinated omission is bad, and it is now fixed in Cassandra Stress. This post talks a bit on motivation (part I), a bit on implementation (part II) and a bit on what you can do with the new features (part III). Feel free to skip any and all parts, god knows those selfies don't take themselves.

PART I: Demonstrating The Issue

CS default mode is to hit the system under test as hard as it can. This is a common strategy in load generators and can give system writers an interesting edge case to work with. I run the benchmark on my laptop (no attempt at finding out real performance numbers here, I just care about measurement issue) with the example provided workload I can saturate my Cassandra server (I only gave it a single core to run on) pretty easily. CS tells me the following about my miserable little server:

INFO 15:02:46 Results:

INFO 15:02:46 Op rate : 5,055 op/s [insert: 500 op/s, simple1: 4,556 op/s]

INFO 15:02:46 Partition rate : 17,294 pk/s [insert: 12,739 pk/s, simple1: 4,556 pk/s]

INFO 15:02:46 Row rate : 150,266 row/s [insert: 12,739 row/s, simple1: 137,527 row/s]

INFO 15:02:46 Latency mean : 4.5 ms [insert: 7.5 ms, simple1: 4.2 ms]

INFO 15:02:46 Latency median : 3.0 ms [insert: 5.4 ms, simple1: 2.8 ms]

INFO 15:02:46 Latency 95th percentile : 13.1 ms [insert: 20.1 ms, simple1: 11.9 ms]

INFO 15:02:46 Latency 99th percentile : 23.8 ms [insert: 33.7 ms, simple1: 21.8 ms]

INFO 15:02:46 Latency 99.9th percentile : 49.8 ms [insert: 55.0 ms, simple1: 49.0 ms]

INFO 15:02:46 Latency max : 603.5 ms [insert: 598.7 ms, simple1: 603.5 ms]

INFO 15:02:46 Total partitions : 1,000,000 [insert: 736,585, simple1: 263,415]

INFO 15:02:46 Total errors : 0 [insert: 0, simple1: 0]

INFO 15:02:46 Total GC count : 112

INFO 15:02:46 Total GC memory : 34.850 GiB

INFO 15:02:46 Total GC time : 1.0 seconds

INFO 15:02:46 Avg GC time : 9.4 ms

INFO 15:02:46 StdDev GC time : 16.6 ms

INFO 15:02:46 Total operation time : 00:00:57

With the other 3 cores on my laptop hitting it as hard as they could the median latency on this maxed out server was 3ms. That is pretty awesome. But also, it makes no sense.

How can a maxed out server have a typical response time of 3ms? In reality when servers are maxed out they are unresponsive, the typical response time becomes worse as the load increases. What CS is reporting however is not response time. It is 'latency'. Latency is one of those terms people use to mean many things and in this case in particular it does not mean "response time" but rather "service time". Here's a definition of more specific terms to describe system responsiveness(see wiki on response time):

Response time: The time between the submission of a request and the completion of the response.
Service time: The time between the initiation and completion of the response to a request.
Wait time: The time between the submission of the request and initiation of the response.

In an all out test one could argue we want all the results as soon as possible, and given a magic load generator they would all launch instantaneously at the benchmark start. This will mean submission time for all requests is at the beginning of the run. Naturally the server will not be able to respond instantly to all requests and can only allow so many requests to be handled in parallel. If the max throughput is 5000 ops/sec, and we are measuring for 100,000 ops it would mean 95K ops have waited a second, so their response time would be at least 1 second (response time > wait time). By the end of the run we would have 5K ops which have patiently waited at least 19 seconds (so 99%ile should be at least 19 seconds).

It follows that in an all out throughput benchmark response time is terrible by definition, and completely uninformative. It also follows that we should not expect the 'latency' above to be at all indicative of the sort of response time we would get from this server.

The alternative to an all out throughput benchmark is a Responsiveness Under Load (RUL) benchmark. Using Cassandra Stress one can (or rather they could before this fix went in) use the '-rate limit=17000/s' option to benchmark under a load of 17k pks/sec.(pks = partition keys, each operation costs X keys, throughput limit is specified in pks not ops) Running this gives me a warm fuzzy feeling, now for sure I shall get a glimpse of the response time at max throughput:

INFO 08:03:54 Results:

INFO 08:03:54 Op rate : 3,712 op/s [insert: 369 op/s, simple1: 3,343 op/s]

INFO 08:03:54 Partition rate : 12,795 pk/s [insert: 9,452 pk/s, simple1: 3,343 pk/s]

INFO 08:03:54 Row rate : 110,365 row/s [insert: 9,452 row/s, simple1: 100,913 row/s]

INFO 08:03:54 Latency mean : 1.0 ms [insert: 1.6 ms, simple1: 0.9 ms]

INFO 08:03:54 Latency median : 0.7 ms [insert: 1.3 ms, simple1: 0.7 ms]

INFO 08:03:54 Latency 95th percentile : 2.2 ms [insert: 3.4 ms, simple1: 2.0 ms]

INFO 08:03:54 Latency 99th percentile : 4.6 ms [insert: 7.4 ms, simple1: 4.1 ms]

INFO 08:03:54 Latency 99.9th percentile : 13.4 ms [insert: 23.8 ms, simple1: 12.1 ms]

INFO 08:03:54 Latency max : 63.9 ms [insert: 59.9 ms, simple1: 63.9 ms]

INFO 08:03:54 Total partitions : 300,000 [insert: 221,621, simple1: 78,379]

INFO 08:03:54 Total errors : 0 [insert: 0, simple1: 0]

INFO 08:03:54 Total GC count : 33

INFO 08:03:54 Total GC memory : 10.270 GiB

INFO 08:03:54 Total GC time : 0.2 seconds

INFO 08:03:54 Avg GC time : 7.5 ms

INFO 08:03:54 StdDev GC time : 2.5 ms

INFO 08:03:54 Total operation time : 00:00:23

This seems nice, and if I were not a suspicious man I might accept it. The thing is, I asked for 17k pks per second, but I only got 12,795 pk/s so obviously the server could not meet the implied schedule. If it could not meet the schedule, response time should be terrible. But it's not, it's very similar to the result we got above. Because? Because again, latency here means service time and not response time. While response time is not informative in an all out test, in an RUL benchmark it is the whole purpose of the benchmark. I have a schedule in mind, requests come at a particular rate, which implies they have a known start time (request n will start at: T0 + n/rate, T0 being the start of the test). This is coordinated omission, lets fix it.

Part II(a): It Puts The Measurements In The HdrHistogram or It Gets The Hose Again!

First off, I like to have HdrHistogram files to work with when looking at latency data (made a whole post about it too). They are usually tons better than whatever it is people hand roll, and they come with a bunch of tooling that makes data post processing easy. Importantly HdrHistograms are loss-less, configurable compact and support loss-less and compressed logging. Combine that with the high resolution of data and you have a great basis for post run analysis.
Cassandra Stress had it own sampling latency collection mechanism which would randomly drop some samples as a means to improve memory footprint, so the replacement improved reporting accuracy and reduced the amount of code. A side effect of this change is that Cassandra perf regression testing stability has improved rather dramatically. As indicated by this graph:

Because of the random sampling 99.9%ile reporting was unstable before the patch went in (May 28th), but smoooooth ever since. Ain't that nice?

Here's the commit with the main changes on the branch that became the PR, there were further changes introduced to enable histogram logging to files. The change is mostly tedious, but importantly it introduces the notion of split response/service/wait histograms. The data is recorded as follows:

The question is, what is the intended start time?

Part II(b): Where Do I Start?

Load generators, when not going all out, generally have some notion of schedule. It is more often than not quite simply a fixed rate of requests, though the notion of schedule holds regardless of how you come up with it. A schedule in this context means that for any event X there is a time when it should happen: st(X). That time is easily computed in a fixed rate schedule as: "st(n) = T0 + n/rate". Cassandra Stress however was using google's RateLimiter to provide it with the scheduling, and while battle tested and handy it does not have a notion of schedule. The replacement took place in 2 steps.

First I refactored the existing code into hiding the details of how operations are scheduled and where they come from behind a blocking queue like interface. The next step was to support a fixed rate stream of operations where the intended schedule is available so I can use it. This is what I ended up with (further tweaked to only start the clock when all the consumer initialization has happened):

Now we're all set to benchmark with no Coordinated Omission!

Part II(c): Who Reports What?

The measurement collection now all set up we face the question of what should different load generating modes report. Since it turned out that 'limit' was a confusing name (hard limit? upper limit?) it was decided to replace it with 2 distinct modes, adding to a total of 3 modes:

Throughput: latency == service time, response/wait time not recorded. Maximum throughput is an important test mode for flushing out CPU bottlenecks and contention. It may help in exposing IO configuration issues as well. It is not however a good predictor of response time distribution as no attempt is made to estimate independent request timing (i.e uncoordinated measurement). The maximum throughput number is valuable to batch processing etc, but I'd caution against using it for the sizing of a responsive system. If a system has responsiveness SLAs it is better to use the 'fixed' test mode and run for significant periods to determine the load under which response times SLA can be met.
Throttle ("-rate throttle=1000/s"): latency == service time, response/wait time recorded like in fixed. This mode is a compromise for people who liked the old 'limit' measurement and want to measure "service time under attempted load". The partitions/second parameter is attempted for, but the summary does not reflect the response time. If you like, this is a way to sneakily record the response time while keeping the summary as it is so people are not so startled by the difference. I don't see myself using it, but have a blast.
Fixed ("-rate fixed=1000/s"): latency == response time, service/response/wait time recorded. This mode is for benchmarking response time under load. The partitions/second parameter is used to determine the fixed rate scheduled of the requests. The hdr log can be used to visualize latencies over time or aggregate latencies for any segment down to the logging frequency. The logs contain response/wait/ service components that can be extracted and handled separately.

In all of the above you can choose to record the HDR histograms to a file on a fixed interval. The interval is the one used for summary reporting. To enable histogram logging use: " -log interval=10s hdrfile=cs.hdr"
Note that the interval should be quite long, and the default of 1s is not actually achievable by Cassandra Stress at the moment. This is down to the terrible way the reporting thread synchronizes with the load generation threads, and while it is one my wish list ("I wish I had time to fix it...") it was outside the scope of fixing CO, so lived to taunt us another day. I've settled on 10 second intervals for now, you can have even longer ones, all depends on the sort of granularity you want in your reports.

Part III: What Is It Good For?

So we got 3 load generating modes, and we've broken up latency into 3 components, the numbers are in histograms, the histograms are in the logs... let's have a little worked example.

I run a Cassandra 2.1 node on my laptop (old software, bad hardware... I don't care. This is not about absolute numbers, it's about the measurement features). To run this please:

Clone the Cassandra trunk: git clone https://github.com/apache/cassandra.git
Build Cassandra and Cassandra Stress: ant clean build stress-build jar

I use the example workload provided by the good Cassandra folks, and start with an all out stress test from cassandra/tools/bin:

$ ./cassandra-stress user profile=../cqlstress-example.yaml duration=60s ops$insert=1,simple1=9$ -mode native cql3 protocolVersion=2 -rate threads=3 -log interval=10s hdrfile=throughput.hdr

Joy, my laptop is giving me awesome throughput of 13,803 pk/s. The summary itself is pretty informative for throughput runs what do we win with the HDR log?
The log we produce can be summarized, manipulated and graphed by a bunch of utilities. Here's what the log looks like (this is from the fixed mode run):

Note that while we logged our latency in nanoseconds, the max column is in milliseconds. The nanosecond level measurements are still available in the compressed histogram to the right. Sadly it's not very friendly on the eye. HdrHistogram does include a log processing utility, but it offers quite basic facilities. I've put together a few utilities for histogram log management in HdrLogProcessing. These allow you to split, union and summarize logs and work with the tags feature. Lets make them into handy aliases:

hdrsum=java -jar HdrLogProcessing-1.0-SNAPSHOT-jar-with-dependencies.jar SummarizeHistogramLogs
hdrspl=java -jar HdrLogProcessing-1.0-SNAPSHOT-jar-with-dependencies.jar SplitHistogramLogs
hdruni=java -jar HdrLogProcessing-1.0-SNAPSHOT-jar-with-dependencies.jar UnionHistogramLogs

The throughput.hdr file we requested uses a recently added HdrHistogram feature which allows for the tagging of different histograms in one log. This makes it easy for applications logging histograms for different measurements to shove them all into a single file rather than many. Since we want to track 2 different operations with their separate response/service/wait times (so up to 6 files) this is quite handy. We can start by using the all in one HdrLogProcessing summary utility to add up the histogram data. The default mode of summary is percentiles and will produce a summary very similar to the one produced above by Cassandra Stress (using the parameters "-if throughput.hdr -ovr 1000000" to summarize in milliseconds):

We can have the microsecond or nanosecond report by tweaking the outputValueUnitRatio(ovr). We are also free to ignore tags and create an all inclusive summary by specifying the ignoreTag(it) parameter. Used together to create a total summary in microseconds we get ("-if throughput.hdr -ovr 1000 -it" ):

As a bonus, the summary tool allows us to process multiple files into the same result (logs from multiple benchmark runs for instance) and selecting periods within the logs to include.
The summary tool also supports generating the HGRM format which allows us to produce the following graph (using "-if throughput.hdr -ovr 1000 -st hgrm -of tpt" and the google charts provided in HdrHistogram):

Now imagine we used 3 different machines to stress a single Cassandra node. Because the logs are additive and loss less there's no issue with using the union tool to aggregate them all into a single log and process it as such. Similarly you can use the splitting tool to split out the operation you are interested in and manipulate it in isolation. Indeed the sky is the limit.
Now, with a stunning 13K partitions per second, an a 50ms 99.99%ile I might think it a reasonable idea to say my server can safely sustain a 10K pk/s rate and call it a day. I shall test this adventurous little estimate using the throttle mode:
$ ./cassandra-stress user profile=../cqlstress-example.yaml duration=60s ops$insert=1,simple1=9$ -mode native cql3 protocolVersion=2 -rate throttle=10000/s threads=3 -log interval=10s hdrfile=throttle.hdr

There according to this, we should have no trouble at all with 10K pk/s. I mean sure it's a bit close to the SLA, but surely no too bad? The throttle mode is in this case just a way to keep your head in the sand a bit longer, but it does record the response time in case you feel like maybe comparing them. Let's compare the service time histogram and the response time histogram for this run. To get both operations service time histograms I need to play with the log processing a bit:

Split the throttle.hdr file:"hdrspl -if throttle.hdr" -> This results in 6 different files <op>-<rt/st/wt>.throttle.hdr.
Summarize the service time histograms:"hdrsum -if .*.-st.throttle.hdr -ovr 1000 -it -st hgrm -of throttle-st" -> we get throttle-st.hgrm
Summarize the service time histograms:"hdrsum -if .*.-rt.throttle.hdr -ovr 1000 -it -st hgrm -of throttle-rt" -> we get throttle-rt.hgrm
Load them into the google charts provided in HdrHistogram.

Take the red pill and you can keep your estimate, take the blue pill and you might keep your job.
This is a fine demonstration of just how far from schedule operations can get without service time measurement registering an issue. In other words the red line, service time is the measurement with coordinated-omission, the blue is measurement which does not suffer from it.
If you are finally convinced that you should be looking at response time instead of service time you can skip the throttle mode and move on straight to the fixed mode. The difference is only in what view of the world you want to see in your summary.
Finally, you can take the histogram logs and graph the latencies over time using HdrHistogramVisualizer. It currently does not handle tags well, so you'll need to split your logs first, but after that you can generate graphs as funky as this one (this is a longer run with fixed 10K rate, plotting the maximum latency for inserts at the top):
$ ./cassandra-stress user profile=../cqlstress-example.yaml duration=360s ops$insert=1,simple1=9$ -mode native cql3 protocolVersion=2 -rate fixed=10000/s threads=3 -log interval=10s hdrfile=fixed.hdr
$ hdrspl -if fixed.hdr

This tells an interesting story, it seems as if the server was coping alright with the load to begin with, but hit a bump (GC? compaction?) after 225 seconds, and an even larger bump slightly later. That's lovely to know. I'm not much of a UI wiz and I'm sure some readers out there can make some valuable PRs to this project ;-)

Summary: What's new?

Given that with recent version of Cassandra Stress you can now test older versions of Cassandra as well (using the protocolVersion parameter as demonstrated above), you can stop using your crummy old version of Cassandra Stress, build trunk from source to get the following benefits:

HdrHistogram latency capturing and logging. Set the following options to get your very own histogram log: "-log interval=10s hdrfile=myawesomelog.hdr".
Have a look at the HdrLogProcessing project for some handy utilities to help you slice and dice the data. They are pretty simple and you can feel free to build/contribute your own.
2 new load generation modes: throttle and fixed now replace the old limit. Use fixed to get a view on your cluster's response time under attempted constant load.

With HdrHistogram support now available in many languages, you can easily build post processing utilities for analysis in a language of your choice.
Have fun storming the castle!

Many many thanks to Jake Luciany of the Apache Cassandra project for reviewing the PR, helping to shape it, and merging it it. Jake is an Open Source hero, buy that man a drink on sight!
Both Jake and Chris Batey reviewed this post, if any errors remain it is down to their disgustingly sloppy and unprofessional ways, please let me know and I shall have words with their parents.

The Pros and Cons of AsyncGetCallTrace Profilers

2016-06-23T12:45:00.000+01:00

So, going on from my whingy post on safepoint bias, where does one go to get their profiling kicks? One option would be to use an OpenJDK internal API call AsyncGetCallTrace to facilitate non-safepoint collection of stack traces.
AsyncGetCallTrace is NOT official JVM API. It's not a comforting place to be for profiler writers, and was only implemented on OpenJDK/Oracle JVMs originally (Zing has recently started supporting AGCT to enable support for Solaris Studio and other profilers, I will write a separate post on Studio). It's original use case was for Solaris Studio, and it provides the following API (see forte.cpp, the name is a left over from the Forte Analyzer days). Here's what the API adds up to:

Simple right? You give it a ucontext and it fills in the trace with call frames (or it gets the hose again).

A Lightweight Honest Profiler

The 'async' in the name refers to the fact that AGCT is safe to call in a signal handler. This is mighty handy in a profiling API as it means you can implement your profiler as follows:

In a JVMTI agent, register a signal handler for signal X.
Setup a timer, triggering the signal X at the desired sample frequency. Honest Profiler is using the ITIMER_PROF option, which means we'll get signalled based on CPU time. The signal will be sent to the process and one of the running threads will end up being interrupted and calling into our signal handler. Note that this assumes the OS will distribute the signals fairly between threads so we will get a fair sample of all running threads.
From signal handler, call AGCT: Note that the interrupted thread (picked at 'random' from the threads currently executing on CPU) is now running your signal handler. The thread is NOT AT A SAFEPOINT. It may not be a Java thread at all.
Persist the call trace: Note that when running in a signal handler only 'async' code is legal to run. This means for instance that any blocking code is forbidden, including malloc and IO.
WIN!

The ucontext ingredient is the very same context handed to you by the signal handler (your signal handler is a callback of the signature handle(int signum, siginfo_t *info, void *context)). From it AGCT will dig up the instruction/frame/stack pointer values at the time of the interrupt and do it's best to find out where the hell you landed.
This exact approach was followed by Jeremy Manson, who explained the infrastructure and open sourced a basic profiler (in a proof of concept, non commital sort of way). His great series of posts on the matter:

The same code was then picked up by Richard Warburton and further improved and stabilized in Honest-Profiler (to which I have made some contributions). Honest Profiler is an effort to make that initial prototype production ready, and compliment it with some tooling to make the whole thing usable. The serialization in a signal handler issue is resolved by using a lock-free MPSC ring-buffer of pre-allocated call trace structs (pointing to preallocated call frame arrays). A post-processing thread then reads the call traces, collects some extra info (like converting BCI to line number, jmethodIds into class names/file names etc) and writes to a log file. See the projects wiki for more details.
The log file is parsed offline (i.e. by some other process, on another machine, at some other time. Note that you can process the file while the JVM is still profiling, you need not wait for it to terminate.) to give you the hot methods/call tree breakdown. I'm going to use Honest-Profiler in this post, so if you want to try this out at home you're going to have to go and build it to experiment locally (works on OpenJDK/Zulu 6/7/8 + Oracle JVMs + recent Zing builds). I'll do some comparisons for the same experiments with JMC, you'll need an Oracle JVM (1.7u40 and later) to try that out.

What Does AGCT Do?

By API we can say AGCT is a mapping between instruction/frame/stack pointer and a call trace. The call trace is an array of Java call frames (jmethodId, BCI). To produce this mapping the following process is followed:

Make sure thread is in 'walkable' state, in particular not when:

Thread is not a Java thread.
GC is active
New/uninitialized/just about to die. I.e. threads that are either before or after having Java code running on them are of no interest.
During a deopt

Find the current/last Java frame (as in actual frame on the stack, revisit Operating Systems 101 for definitions of stacks and frames):

The instruction pointer (commonly referred to as the PC - Program Counter) is used to look up a matching Java method (compiled/interpreter). The current PC is provided by the signal context.
If the PC is not in a Java method we need to find the last Java method calling into native code.
Failure is an option! we may be in an 'unwalkable' frame for all sorts of reasons... This is quite complex and if you must know I urge you to get comfy and dive into the maze of relevant code. Trying to qualify the top frame is where most of the complexity is for AGCT.

Once we have a top frame we can fill the call trace. To do this we must convert the real frame and PC into:

Compiled call frames: The PC landed on a compiled method, find the BCI (Byte Code Index) and record it and the jMethodId
Virtual call frames: The PC landed on an instruction from a compiled inlined method, record the methods/BCIs all the way up to the framing compiled method
Interpreted call frames
From a compiled/interpreted method we need to walk to the calling frame and repeat until we make it to the root of the Java call trace (or record enough call frames, whichever comes first)

WIN!

Much like medication list of potential side effects, the error code list supported by a function can be very telling. AGCT supports the following reasons for not returning a call trace:

While this data is reported from AGCT, it is often missing from the reports based on it. More on that later.

The Good News!

So, looking at the bright side, we can now see between safepoint polls!!! How awesome is that? Lets see exactly how awesome by running the benchmarks which we could not measure correctly with the safepoint biased profiler in the previous post.
[Note: Honest-Profiler reports (t X,s Y) which reflect the total % of stack trace samples containing this method+line vs. the self % of samples in which this method+line is the leaf. The output is sorted by self.]:

Sweet! Honest Profiler is naming the right methods as the hot methods, and nailing the right lines where the work is happening.

Let's try the copy benchmark, this time comparing the Honest-Profiler result with the JMC result:

Note the difference in reporting switching on the "-XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints" flags makes to JMC. Honest-Profiler now (recent change, hot off the press) causes the flag to be on without requiring user intervention, but JMC does not currently do that (this is understandable when attaching after the process started, but should perhaps be default if the process is started with recording on).

Further experimentation and validation exercises for the young and enthusiastic:

Compare the Honest-Profiler results with JFR/JMC results on larger applications. What's different? Why do you think it's different?
Run with -XX:+PrintGCApplicationStoppedTime to see no extra safepoints are caused by the profiling (test with profiler on/off).
The extra-studious of y'all can profile the other cases discussed in previous post to see they similarly get resolved.

While this is a step forward, we still got some issues here...

Collection errors: Runtime Stubs

As we can see from the list of errors, any number of events can result in a failed sample. A particularly nasty type of failed sample is when you hit a runtime generated routine which AGCT fails to climb out of. Let see what happens when we don't roll our own array copy:

Now, this great unknown marker is a recent addition to Honest-Profiler, which increases it's honesty about failed samples. It indicates that for 62.9% of all samples, AGCT could not figure out what was happening and returned ticks_unknown_java. Given that there's very little code under test here we can deduce the missing samples all fall within System.arrayCopy (you can pick a larger array size to further prove the point, or use a native profiler for comparison).

Profiling the same benchmarks under JMC will not report failed samples, and the profiler will divide the remaining samples as if the failed samples never happened. Here's the JMC profile for systemArrayCopy:

Number of samples(over 60 seconds) : 2617

Method::Line Samples %

systemArrayCopy_avgt_jmhStub(...) :: 165 1,729 66.043

systemArrayCopy() :: 41 331 12.643

systemArrayCopy() :: 42 208 7.945

systemArrayCopy_avgt_jmhStub(...) :: 166 93 3.552

systemArrayCopy_avgt_jmhStub(...) :: 163 88 3.361

JMC is reporting a low number of samples (the total is sort of available in the tree view as the number of samples in the root), but without knowing what the expected number of samples should be this is very easy to miss. This is particularly true for larger and noisier samples from real applications collected over longer period of time.

Is this phenomena isolated to System.arrayCopy? Not at all, here's crc32 and adler32 as a further comparison:

What do Crc32 and System.arrayCopy have in common? They are both JVM intrinsics, replacing a method call (Java/native don't really matter, though in this case both are native) with a combination of inlined code and a call to a JVM runtime generated method. This method call is not guarded the same way a normal call into native methods is and thus the AGCT stack walking is broken.
Why is this important? The reason these methods are worth making into intrinsics is because they are sufficiently important bottlenecks in common applications. Intrinsics are like tombstones for past performance issues in this sense, and while the result is faster than before, the cost is unlikely to be completely gone.
Why did I pick CRC32? because I recently spent some time profiling Cassandra. Version 2 of Cassandra uses adler32 for checksum, while version 3 uses crc32. As we can see from the results above, it's potentially a good choice, but if you were to profile Cassandra 3 it would look better than it actually is because all the checksum samples are unprofilable. Profiling with a native profiler will confirm that the checksum cost is still a prominent element of the profile (of the particular setup/benchmark I was running).

AGCT profilers are blind to runtime stubs (some or all, this may get fixed in future releases...). Failed samples are an indication of such a blind spot.

Exercise to the reader:

Construct a benchmark with heavy GC activity and profile. The CPU spent on GC will be absent from you JMC profile, and should show as ticks_GC_active in your Honest-Profiler profile.
Construct a benchmark with heavy compiler activity and profile. As above look for compilation CPU. There's no compilation error code, but you should see allot of ticks_unknown_not_Java which indicate a non-Java thread has been interrupted (this is a conflated error code, we'll be fixing it soon).
Extra points! Construct a benchmark which is spending significant time deoptimising and look for the ticks_deopt error in your Honest-Profiler profile.

Blind Spot: Sleeping code

Since sleeping code doesn't actually consume CPU it will not appear in the profile. This can be confusing if you are used to the JVisualVM style reporting where all stacks are reported including WAITING (or sometimes RUNNING Unsafe.park). Profiling sleeping or blocked code is not something these profilers cover. I mention this not because they promise to do this and somehow fail to deliver, but because it's a common pitfall. For Honest-Profiler this feature should help highlight mostly off-CPU applications as such by comparing expected and actual samples (the delta is signals which were not delivered to the process as it was not running).

Error Margin: Skidding + inlining

I'll dive deeper into this one another time, since other profilers share this issue. In essence the problem is that instruction profiling is not accurate and we can often encounter a skid effect when sampling the PC(program counter). This in effect means the instruction reported can be a number of instructions after the instruction where the time is spent. Since AsyncGetCallTrace relies on PC sampling to resolve the method and code line we get the same inaccuracy reflected through the layer of mapping (PC -> BCI -> LOC) but where on the assembly level we deal with a single blob of instructions where the proximity is meaningful, in converting back to Java we have perhaps slipped from one inlined method to the next and the culprit is no longer in a line of code nearby.

To be continued...

Summary: What is it good for?

AsyncGetCallTrace is a step up from GetStackTraces as it operates at lower overheads and does not suffer from safepoint bias. It does require some mental model adjustments:

-XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints : if this is not on you are still hobbled by the resolution of the debug data. This is different from safepoint bias, since you actually can sample anywhere, but the translation from PC to BCI kills you. I've not seen these flags result in measurable overhead, so I'm not convinced the default value is right here, but for now this is how it is. Honest-Profiler takes care of this for you, with JMC you'll need to add it to your command line.

But seriously guys, we should be using Fortran for this

Each sample is for a single, on CPU, thread: This is very different from the GetStackTraces approach which sampled all threads. It means you are getting less traces per sample, and are completely blind to sleeping/starved threads. Because the overhead is that much lower you can compensate by sampling more frequently, or over longer periods of time. This is a good thing, sampling all the threads at every sample is a very problematic proposition given the number of threads can be very large.

AsyncGetCallTrace is great for profiling most 'normal' Java code, where hotspots are in your Java code, or rather in the assembly code they result in. This seems to hold in the face of most optimizations to reasonable accuracy (but may on occasion be off by quite a bit...).

AsyncGetCallTrace is of limited use when:

Large numbers of samples fail: This can mean the application is spending it's time in GC/Deopts/Runtime code. Watch for failures. I think currently honest-profiler offers better visibility on this, but I'm sure the good folks of JMC can take a hint.
Performance issue is hard to glean from the Java code. E.g. see the issue discussed in a previous post using JMH perfasm (false sharing on the class id in an object header making the conditional inlining of an interface call very expensive).
Due to instruction skid/compilation/available debug information the wrong Java line is blamed. This is potentially very confusing in the presence of inlining and code motion.

Using Honest-Profiler you can now profile Open/Oracle JDK6/7/8 applications on Linux and OS X. You can also use it to profile Zing applications on recent Zing builds (version 15.05.0.0 and later, all JDKs). Honest-Profiler is lovely, but I would caution readers that it is not in wide use, may contain bugs, and should be used with care. It's a useful tool, but I'm not sure I'd unleash it on my production systems just yet ;-).
JMC/JFR is available on Oracle JVMs only from JDK7u40, but works on Linux, OS X, Windows and Solaris (JFR only). JMC/JFR is free for development purposes, but requires a licence to use in production. Note that JFR collects a wealth of performance data which is beyond the scope of this post, I whole heartedly recommend you give it a go.

A big thank you to all the kind reviewers: JP Bempel, Doug Lawrie, Marcus Hirt and Richard Warburton, any remaining errors will be deducted from their bonuses directly.

GC 'Nepotism' And Linked Queues

2016-03-21T10:16:00.001+00:00

I've just run into this issue this week, and it's very cute, so this here is a summary. Akka has their own MPSC linked queue implementation, and this week in was suggested they'd swap to using JCTools. The context in which the suggestion was made was a recently closed bug with the mystery title: AbstractNodeQueue suffers from nepotism
An Akka user claimed to not be suffering from the bug because they were using JCTools, but as it turned out the issue was indeed present so I had to fix it etc... But what the hell is GC nepotism? You won't find it in your GC textbooks, but it's a catchy term for a very real problem.

Your Dead Olds Relatives Can Promote You!

In the open JDK a generational GCs, there are distinct generations, Old and New. The 2 generations get collected separately, but are not independent from each other. In particular, references from NewGen to OldGen will keep OldGen references alive through an OldGen collection cycle, and similarly references from OldGen to NewGen will keep objects alive through a NewGen collection. This is all quite straight forward, but has a surprising implication:

Dead OldGen objects will keep dead NewGen objects they reference alive until collected by an OldGen.

The reverse is also true, but since the whole notion of generational GC is based on the assumption that new gen is collected far more often then the old gen the above use case is the real issue. How does this effect you? Is this even a real problem?
Turns out it's real enough to have convinced Doug Lea to fix a particular manifestation of it, but also subtle enough to have slipped past him (and me, and the Akka guys, and Martin Thompson, and others no doubt) in the first place.

Linked Queues: Pathological Nepotism

Consider the following LinkedQueue implementation:

The following invariants are maintained:

head == tail <=> queue is empty
head.next == null
tail.value == null
In summary: queue is empty <=> (head == tail && tail.next == null && tail.value == null)

The nodes are created and linked together on offer, and dereferenced as we poll. The poll method clears MUST clear the value because the queue retains the last node. If we were to let the last node keep the value we will have effectively leaked this value reference.

Ignoring the values, imagine the queue is part of the live set and we have enqueued a few elements, the arrow denotes what next references:
head -> N3 -> null
tail -> N0 -> N1 -> N2 -> N3

If we now dequeue the elements we'll get back to empty, brackets denote live referenced from dead:
head -> N3 -> null
tail -> N3 -> null
DEAD: N0 -> N1 -> N2 (-> N3)

Dead objects are traditionally considered the GC problem, and cleaning up about to be collected objects is considered an anti-pattern. After all, what is the point of cleaning up garbage? you are just duplicating work. Indeed in a single generation collector this is a non-issue.

But consider for a moment the possibility that one of the above nodes was promoted into the OldGen. This is something which may not happen in short lived tests, or even 'lucky' long lived tests, but is obviously quite possible. We will have:

head -> N3 -> null
tail -> N3 -> null
Zombie NEW: N1 -> N2 (-> N3)

DEAD OLD: N0 (-> N1 -> N2 (-> N3))

The young are alive, as far as the old are concerned. The old are dead, but their promotion gives life to their 'children' until they get collected. To demonstrate this issue I've written the following program:

If you look at the GC logs, or plug in Visual VM with the Visual GC plugin you will notice the young GC is entirely ineffective. Everything gets promoted to old, until an old GC clears the lot. After the old GC the problem disappears (in the test program, but is likely to repeat in real programs). This is 'pathological' because ANY promoted node will result in the promotion of ALL following nodes until a GC resolves the issue.

At this point you might be shaking your head sadly and saying: "Ye thick old fart, surely this is a silly issue which NEVER happens in the real world". I've had my share of heads being shook at me and am not offended, but consider that this issue is real enough for other people:

AbstractNodeQueue potential GC leak - many objects get tenured if we have "bursts" of messages
Young memory "leak" in LinkedBlockingQueue : see also this post on the same with further background. See relevant comment on concurrency-interest from Doug Lea, I find the comment "it may be that the effects didn't show up on the tests, machines, VMs I was using. But these are subject to transient changes" in particular reflects the difficulty is exposing the issue in synthetic testing.
This war story from Tony Printezis referring to this as a recurring pattern in Java services (I wholeheartedly recommend you watch Tony's presentation from beginning to end)

An Exercise To the Reader

My plan was to walk through the way this issue has been fixed in JCTools vs Akka vs Agrona. Each solution reflecting different considerations, but sadly I'm quite short on time for a tidy summary and will give my observations which you can choose to read up on and argue with.
All three implementations follow the algorithm outlined by D. Vyukov. Note that Agrona is ignoring the Queue interface requirement for poll(return null iff queue is empty).

In Akka the next reference is nulled on poll. This solves the issue, but at the cost of further weakening the accuracy of the size() method.
In JCTools the next reference is set to point to the discarded node. This helps in telling the difference between a removed node and the head. This piece of information is used in the size method to skip gaps in the size counted chain when racing with the consumer thread. The result is a more accurate size estimate. The change does come at a cost to the size method however as we are no longer able to rely on the chain remaining whole throughout the size operation.
In Agrona, the next reference is nulled as in Akka. A further effort is made to reuse a special empty node to avoid repeating promotion of transient nodes in empty queues. The effort is arguably worth while on systems with large numbers of active but mostly empty queues.

Open questions/discussion:

How common is this issue for 'classic' data structures? it seems reasonable that a host of queues/stacks/tree data structures are similarly vulnerable. A scan of JDK sources suggests this is an known pitfall to them, but I would bet many third party frameworks writers didn't get the memo.
How common is this issue for idiomatic 'user' Java code? given the tendency towards composition rather than inheritance and the drive towards smaller objects, we may argue that user object graphs often approximate trees. If we have short-lived 'trees' a parent node can certainly drag a useless chunk of state to the old gen. If a reference bridges 2 trees we may have something approximating the pathological behaviour above. I'm not aware of any studies trying to determine the impact of nepotism on user code, part of the problem would be the definition of 'typical' user code.
Where does the term 'nepotism' come from? I made an attempt at tracing the origins, but got nowhere. The problem seems to be documented at least 8 years ago, but the term nepotism is not used by most writing. [UPDATE: it seems the term can be tracked down to a GC paper published in 1991:An Adaptive Tenuring Policy for Generation]

Why (Most) Sampling Java Profilers Are Fucking Terrible

2016-02-24T07:07:00.000+00:00

This post builds on the basis of a previous post on safepoints. If you've not read it you might feel lost and confused. If you have read it, and still feel lost and confused, and you are certain this feeling is related to the matter at hand (as opposed to an existential crisis), please ask away.
So, now that we've established what safepoints are, and that:

Safepoint polls are dispersed at fairly arbitrary points (depending on execution mode, mostly at uncounted loop back edge or method return/entry).
Bringing the JVM to a global safepoint is high cost

We have all the information we need to conclude that profiling by sampling at a safepoint is perhaps a bit shite. This will not come as a surprise to some, but this issue is present in the most commonly used profilers. According to this survey by RebelLabs this is the breakdown:

VisualVM, NB Profiler(same thing), YourKit and JProfiler all provide a sampling CPU profiler which samples at a safepoint. Seeing how this is a rather common issue, lets dig into it.

How Sampling Execution Profilers Work (in theory)

Sampling profilers are supposed to approximate the 'time spent' distribution in our application by collecting samples of where our application is at different points in time. The data collected at each sample could be:

current instruction
current line of code
current method
current stack trace

The data can be collected for a single thread or all threads at each sample. What do we need to hold for sampling to work?

"However, for sampling to produce results that are comparable to a full (unsampled) profile, the following two conditions must hold. First, we must have a large number of samples to get statistically significant results. For example, if a profiler collects only a single sample in the entire program run, the profiler will assign 100% of the program execution time to the code in which it took its sample and 0% to everything else. [...]
Second, the profiler should sample all points in a program run with equal probability. If a profiler does not do so, it will end up with bias in its profile. For example, let’s suppose our profiler can only sample methods that contain calls. This profiler will attribute no execution time to methods that do not contain calls even though they may account for much of the program’s execution time." - from Evaluating the Accuracy of Java Profilers, we'll get back to this article in a bit

That sounds easy enough, right?

Once we have lots of samples we can construct a list of hot methods, or even code lines in those methods (if the samples report it), we can look at the samples distributed on the call tree (if call traces are collected) and have an awesome time!

How Do Generic Commercial Java Sampling Execution Profilers Work?

Well, I can sort of reverse engineer here from different solutions, or read through open source code bases, but instead I'll offer unsupported conjecture and you are free to call me out on it if you know better. Generic profilers rely on the JVMTI spec, which all JVMs must meet:

JVMTI offers only safepoint sampling stack trace collection options (GetStackTrace for the calling thread doesn't require a safepoint, but that is not very useful to a profiler. On Zing GetStackTrace to another thread will bring only that thread to a safepoint.). It follows that vendors who want their tools to work on ALL JVMs are limited to safepoint sampling.
You hit a global safepoint whether you are sampling a single thread or all threads (at least on OpenJDK, Zing is slightly different but as a profiler vendor OpenJDK is your assumption.). All profilers I looked into go for sampling all threads. AFAIK they also do not limit the depth of the stack collected. This amounts to the following JVMTI call: JvmtiEnv::GetAllStackTraces(0, &stack_info, &thread_count)
So this adds up to: setup a timer thread which triggers at 'sampling_interval' and gathers all stack traces.

This is bad for several reasons, some of which are avoidable:

Sampling profilers need samples, so it is common to set sampling frequency is quite high (usually 10 times a second, or every 100ms). It's instructive to set the -XX:+PrintGCApplicationStoppedTime and see what sort of pause time this introduces. It's not unusual to see a few milliseconds pause, but YMMV(depending on number of threads, stack depth, TTSP etc). A 5ms pause every 100ms will mean a 5% overhead (actual damage is likely worse than that) introduced by your profiler. You can usually control the damage here by setting the interval longer, but this also means you will need a longer profiling period to get a meaningful sample count.
Gathering full stack traces from all the threads means your safepoint operation cost is open ended. The more threads your application has (think application server, SEDA architecture, lots of thread pools...), and the deeper your stack traces (think Spring and Co.) the longer your application will wait for a single thread to go round taking names and filling forms. This was clearly demonstrated in the previous post. AFAIK, current profilers do nothing to help you here. If you are building your own profiler it would seem sensible to set a limit on either quantities so that you can box your overheads. The JVMTI functionality allows you to query the list of current threads, you could sample all if there's less than a 100 and otherwise pick a random subset of 100 to sample. It would make sense to perhaps bias towards sampling threads that are actually doing something as opposed to threads which spend all their time blocked.
As if all that was not bad enough, sampling at a safepoint is a tad meaningless.

Points 1 and 2 are about profiling overheads, which is basically about cost. In my previous post on safepoints I looked at these costs, so there's no point repeating the exercise. Cost may be acceptable for good profiling information, but as we'll see the information is not that great.
Point 3 bears explaining, so off we go in the search for meaning.

Safepoint Sampling: Theory

So what does sampling at a safepoint mean? It means only the safepoint polls in the running code are visible. Given hot code is likely compiled by C1/C2 (client/server compilers) we have reduced our sampling opportunities to method exit and uncounted loop backedges. This leads to the phenomena called safepoint bias whereby the sampling profiler samples are biased towards the next available safepoint poll location (this fails the second criteria set out above "the profiler should sample all points in a program run with equal probability").
This may not sound so bad at first, so lets work through a simple example and see which line gets the blame.
NOTE: In all of the following examples I will be using JMH as the test harness and make use of the 'CompilerControl' annotation to prevent inlining. This will let me control the compilation unit limits, and may seem cruel and unusual, or at least unfair, of me. Inlining decisions in the 'wild' are governed by many factors and it is safe (in my opinion) to consider them arbitrary (in the hands of several compilers/JVM vendors/command line arguments etc.). Inlinining may well be the "mother of all optimizations", but it is a fickle and wily mother at that.
Let's look at something simple:

This is an easy example to think about. We can control the amount of work in the method by changing the size of the array. We know the counted loop has no safepoint poll in it (verified by looking at the assembly output), so in theory the methods above will have a safepoint at method exit. The thing is, if we let the method above get inlined the end of method safepoint poll will disappear, and the next poll will be in the measurement loop:

So, it would seem reasonable to expect the method to get blamed if it is not inlined, but if it does get inlined we can expect the measurement method to get blamed. Right? very reasonable, but a bit off.

Safepoint Sampling: Reality

Safepoint bias is discussed in this nice paper from 2010: Evaluating the Accuracy of Java Profilers, where the writers recognize that different Java profilers identify different hotspots in the same benchmarks and go digging for reasons. What they don't do is set up some benchmarks where the hotspot is known and use those to understand what it is that safepoint biased profilers see. They state:

"If we knew the “correct” profile for a program run, we could evaluate the profiler with respect to this correct profile. Unfortunately, there is no “correct” profile most of the time and thus we cannot definitively determine if a profiler is producing correct results."

So, if we construct a known workload... what do these profilers see?

We'll study that with the JMH safepoint biased profiler "-prof stack". It is much like JVisualVM in the profiles it presents for the same code, and a damn sight more convenient for this study.
NOTE: In the following sections I'm using the term sub-method to describe a method which is being called into from another method. E.g. if method A calls method B, B is a sub method of A. Maybe a better terminology exists, but that's what I mean here.

If we run the samples above we get 2 different hot lines of code (run with '-prof stack:detailLine=true'):
# Benchmark: safepoint.profiling.SafepointProfiling.meSoHotInline

....[Thread state: RUNNABLE]...

99.6% 99.8% meSoHotInline_avgt_jmhStub:165

# Benchmark: safepoint.profiling.SafepointProfiling.meSoHotNoInline

....[Thread state: RUNNABLE]...

99.4% 99.6% meSoHotNoInline_avgt_jmhStub:163

Neither one in the actually hot method. It seems that the method exit safepoint is not deemed indicative of it's own method, but rather of the line of code from which it was called. So forcing the method under measurement to not inline implicated the calling line of code in the measurement loop, while leaving it to inline meant the back edge of the loop got the blame. It also appears that an uncounted loop safepoint poll is deemed indicative of it's own method.

We may deduce(but we won't necessarily be correct) that when looking at this kind of profile without line of code data a hot method is indicative of:

Some non-inlined sub method is hot*
Some code (own method? inlined sub method? non-inlined sub method?) in an uncounted loop is hot*

Having line of code data can help disambiguate the above cases, but it's not really very useful as line of code data. A hot line of code would be indicative of:

Line has a method call: A method called from this line is hot (or maybe it's inlined sub-methods are)*
Line is a loop back edge: Some code (include inlined submethods) in this loop is hot*

*Does that seem useful? Don't get your hopes up.
Because we usually have no idea which methods got inlined this can be a bit confusing (you can use -XX:+PrintInlining if you want to find out, but be aware that inlining decisions can change from run to run).

Mind The Gap

If the above rules held true you could use a safepoint biased profile by examining the code under the blamed node in the execution tree. In other words, it would mean a hot method indicates the hot code lies somewhere in that code or in the method it calls. This would be good to know, and the profile could serve as a starting point for some productive digging. But sadly, these rules do not always hold true. They are missing the fact that the hot code can be anywhere between the indicated safepoint poll and the previous one. Consider the following example:

Clearly, the time is spent in the loop before calling setResult, but the profile blames setResult. There's nothing wrong with setResult except that a method it calls into is not inlined, providing our profiler with a blame opportunity. This demonstrates the randomness with which the safepoint poll opportunities present themselves to user code, and shows that the hot code may be anywhere between the current safepoint poll and the previous one. This means that a hot method/line-of-code in a safepoint biased profile are potentially misleading without some idea of where the previous safepoint poll is. Consider the following example:

The profiler implicates the caller to a cheap method 9 levels down the stack, but the real culprit is the loop at the topmost method. Note that inlining will prevent a method from showing, but not-inlined frames only break the gap between safepoints on return (on OpenJDK anyhow. This is vendor and whim dependent. e.g. Zing puts the method level safepoint on entry for instance and I'm not sure where J9 stands on this issue. This is not to say that one way is somehow better than the other, just that the location is arbitrary). This is why setResult6 which is not inlined and higher up the stack doesn't show up.

Summary: What Is It Good For?

As demonstrated above, a safepoint sampling profiler can have a wildly inaccurate idea of where the hot code in your application is. This renders derived observations on "Running" threads pretty suspect, but they are at least correct observations on which threads are running. This doesn't mean they are entirely useless and sometimes all we need is a hint in the right direction for some good analysis to happen, but there's a potential for huge waste of time here. While samples of code running in the interpreter do not suffer safepoint bias, this is not very useful as hot code is quickly compiled. If your hot code is running in the interpreter you have bigger fish to fry than safepoint bias...

The stack traces for blocked threads are accurate, and so the "Waiting" profile is useful to discover blocked code profile. If blocking methods are at the root of your performance issue than this will be a handy observation.

There are better options out there! I'll get into some of them in following posts:

- Java Mission Control

- Solaris Studio

- Honest-Profiler

- Perf + perf-map-agent (or perfasm if your workload is wrapped in a JMH benchmark)

No tool is perfect, but all of the above are a damn sight better at identifying where CPU time is being spent.
Big thanks to the many reviewers, if you find any issues please comment and I will punish them severely.

UPDATE: Read follow up post of Honest-Profiler/Java Flight Recorder type profilers in this next post - The Pros And Cons of AsyncGetCallTrace Profilers

Wait For It: Counted/Uncounted loops, Safepoints and OSR Compilation

2016-02-01T07:56:00.000+00:00

In this previous post about Safepoints I claimed that this here piece of code:

Will get your JVM to hang and will not exit after 5 seconds, as one might expect from reading the code. The reason this happens, I claimed, is because the compiler considers for loops limited by an int to be counted, and will therefore not insert a safepoint poll into the generated assembly code. This was the theory, and I tested it, and it definitely hung on my machine ('It Works On My Machine'™).
To my great surprise, some people not only read the posts, they also run the code (it's like they don't trust me ;-) ). And 2 of those diligent readers got back to me in the comments saying that actually the JVM exists just fine. So I ran it again, and it hung, again. Then I ran it with different versions of the JVM, and it hung as I expected. Then I thought, maybe they are running the code I put on GitHub, so I tried that. That also hung when I ran it from Eclipse, but did not when I built it with Maven and ran it from the command line. EUREKA!

As it happens, Eclipse has it's own Java compiler (a fact well known to those who know it well, no doubt) and the bytecode it generated (the .class file) hung every time, but Maven uses javac and the .class file generated from that did not hang. When testing initially I was running my class inside Eclipse, or using the Eclipse .class file to run from the command line. Both .class files are valid, but different (same-same but different). We can use javap to print out class bytecode and compare ("javap -c -v -p <class file>"). For your pleasure you can find the original Java file, the 2 class files and their respective bytecode printouts on GitHub and come to your own conclusions before you continue reading.
So... is my understanding (and the Eclipse javac) somehow out of date? can this problem of telling counted loops from uncounted loops be solved by the bytecode compiler?

How is the bytecode different?

Well, mostly not that different:

An easy to spot difference which we can tell is semantically the same is the bytecode result for "System.out.println("How Odd:" + l);" you'll notice that Eclipse is using the constructor "java/lang/StringBuilder."<init>":(Ljava/lang/String;)" while javac is opting for "java/lang/StringBuilder."<init>":()V" followed by an append. Semantically the 2 choices are the same. We create a StringBuilder and append the String "How odd:" and the variable l. The reality is that there are subtle differences, the Eclipse version will allocate a larger char[] for instance. It may also be that the Eclipse pattern will miss out on some String concatenation optimization because the OpenJDK implementers are likely to focus on patterns generated from the official javac.
The loop interpretation is different. Eclipse is checking the loop conditions at the end of the loop whereas javac is building a loop with the checks upfront and the loop backedge at the end. That seems like the important difference.

Let's zoom in on the relevant bytecode for the loop:

If we were to reverse translate the loop bytecode to ugly Java (which supported goto) we'd get:

Note that there's no special Java bytecode for loop control flows. So why so different at runtime? If we dump out the assembly for the run method we will see that the Eclipse version is missing the safepoint in the loop (the method has a {poll_return} but there's no {poll}), and the javac version has that safepoint there. Is it the bytecode difference that makes this happen?

Variations on bytecode

I tinkered with the class a bit and refactored the loop out of the lambda. This results in similar bytecode for the loop and same behaviour. I then made the limit of the loop a parameter of the method, this time it behaved like the javac version, but had the same funky loop structure:

WTF? if this is a purely a bytecode subtlety then it's seems pretty weird. What sort of flaky compiler would swing like this?

The compiler did it?

But... which compiler? so we looked at the above bytecode compiler results, this is but the first of many compilers to process your code on it's way to being executed. Traditionally when discussing Java code compilation we look at the following stages:

Javac: From code to bytecode
Interpreter: Step through the bytecode
C1/Client: Short compilation time, does not generate the best optimized code. Intended for short lived applications.
C2/Server: Longer compilation time, better optimized code. Intended for long running applications.

And the compilation timeline would look something like this:
/->C1
*.java->javac->*.class->interpreter->class+profile<
\->C2
The situation with OpenJDK8 is more complex with Tiered Compilation allowing us a smooth transition from interpreter to C2 compilation via C1 compilations.
Given the behaviour we are after was supposed to manifest after a C1/C2 compilation we can run the code above to find which compiler compiled the method in question by adding the flag: -XX:+PrintCompilation:
159 201 % 3 safepoint.hang.WhenWillItExitInt2::countOdds @ 13 (44 bytes)
159 202 3 safepoint.hang.WhenWillItExitInt2::countOdds (44 bytes)
159 203 % 4 safepoint.hang.WhenWillItExitInt2::countOdds @ 13 (44 bytes)
162 201 % 3 safepoint.hang.WhenWillItExitInt2::countOdds @ -2 (44 bytes) made not entrant
142 200 % 3 safepoint.hang.WhenWillItExitInt2::countOdds @ -2 (44 bytes) made not entrant
1896 202 % 4 safepoint.hang.WhenWillItExitInt2::countOdds @ -2 (44 bytes) made not entrant
1896 203 % 3 safepoint.hang.WhenWillItExitInt2::countOdds @ 13 (44 bytes)
1896 204 % 4 safepoint.hang.WhenWillItExitInt2::countOdds @ 13 (44 bytes)
1899 203 % 3 safepoint.hang.WhenWillItExitInt2::countOdds @ -2 (44 bytes) made not entrant
The 3/4 printouts on the 4th column above indicate the method is compiled by C1 (profiled) and C2. But what is that % about? This is an indicator that the compilations are not 'normal' compilations but rather On Stack Replacement(OSR) compilations. OSR compilations are 'special', from the OpenJDK glossary:

on-stack replacement
Also known as 'OSR'. The process of converting an interpreted (or less optimized) stack frame into a compiled (or more optimized) stack frame. This happens when the interpreter discovers that a method is looping, requests the compiler to generate a special nmethod with an entry point somewhere in the loop (specifically, at a backward branch), and transfers control to that nmethod. A rough inverse to deoptimization.

I like the "rough inverse to deoptimization", it's a great mental qualification helper. OSR is the opposite of deoptimization because it is a mid-method replacement of code with compiled code, whereas a deoptimization is a mid-method replacement of code with interpreted code. Normal compilation unit replacements happen out of method, so while the method is off stack. OSR is specifically done to replace long running loops, like our countOdds method. As we can see from the printout above both C1 and C2 can produce OSR compilations.
It is impossible to disable OSR directly in a product build (there's a developer option -XX:-CICompileOSR), but we can force upfront compilation by using the following flags: "-XX:-TieredCompilation -server/client -Xcomp". Running with these arguments all 3 variations (original, refactored loop, refactored parametrized loop) now hang both for the Eclipse and the javac generated bytecode.
In both cases where the Eclipse versions hang we can see OSR compilations of the relevant code.
So it seems:

C1/C2 non-OSR compilations of the loop treat it as a counted loop, thus eliminating the safepoint poll and demonstrating the hung behaviour due to huge TTSP.
C1/C2 OSR compilations of the same code result in different results depending on the generated bytecode for the loop. I'm not going to dig further into the reasons, but I would welcome any feedback from OpenJDK compiler guys.

The OSR compilation decision to treat the loop as non-counted is quite reasonable. If we have a loop which remains on stack for very long periods we should perhaps consider it uncounted.

Killer Task

I started down this path looking for valid Java code which will result in a hung JVM due to huge TTSP. Since OSR got in the way, can we build such a task? Sure we can, we just need to warmup the loop until it is C2 compiled and then run it with a large limit, like so:

This fucker now hangs in both the Eclipse and javac versions. Joy!

Side note: OSR and JMH

I look at benchmarks people write quite a bit, and it is very common to see people rolling their own benchmark harnesses put their "code under measurement" in a counted loop, run it 100,000 times and derive their results. Often that means the code they are measuring is OSR compiled code, which is different (for better or worse, mostly worse) from the code they will be running normally. If you are one of these people, I beg you: cut that fucking shit out!

JUST USE JMH!
PLEASE?

To a large extent it will cover your use case. Where it doesn't cover your use case, post a question on jmh-dev mailing list, it might be that you've missed something. If it turns out JMH cannot cover your usecase, make sure you've read the JMH samples before rolling your own harness.
I'm not going to include a before and after example, because it just makes me sad. Look at SO for some fine examples.

Counted Loops?

I keep saying 'counted loops' like it means something, so here are some examples for you to play with and contemplate:

So, as you can see the definition ain't so clear and I would imagine some of the above behaviour will vary depending on JVM version/vendor/phase of the moon etc.

Summary

So, it seems my previous example was not as tested as it should have been. I apologize to anyone who has based life changing decisions on that sample. I hope this post helps in explaining away the subtleties which led to my downfall.

The above examples help demonstrate that the line between counted and uncounted loops is a tad confusing and the classification may depend on differences in bytecode, compiler mode, and so on.

Safepoints: Meaning, Side Effects and Overheads

2015-12-14T11:49:00.001+00:00

I've been giving a couple of talks in the last year about profiling and about the JVM runtime/execution and in both I found myself coming up against the topic of Safepoints. Most people are blissfully ignorant of the existence of safepoints and of a room full of people I would typically find 1 or 2 developers who would have any familiarity with the term. This is not surprising, since safepoints are not a part of the Java language spec. But as safepoints are a part of every JVM implementation (that I know of) and play an important role, here's my feeble attempt at an overview.

What's a Safepoint?

I was recently asked: "Are safepoints like safe words?". The short answer is "not really", but since I like the metaphor I'll run with it:
Imagine if you will a JVM full of mutator threads, all busy, sweating, mutating the heap. Some of them have <gasp> shared mutable state. They're mutating each others state, concurrently, like animals. Some stand in corners mutating their own state (go blind they will). Suddenly a neon sign flashes the word PINEAPPLES. One by one the mutators stop their rampant heap romping and wait, sweat dripping. When the last mutator stops, a bunch of elves come in, empty the ashtrays, fill up all the drinks, mop up the puddles, and quickly as they can they vanish back to the north pole. The sign is turned off and the threads go back to it....
<inner-dialogue>Too much? It's like 35c in the shade here, so sweating is like... no? never mind </inner-dialogue>
There are many references to Safepoints to be found online and what follows here is my attempt at a more nuanced use/definition, no sweating mutators from this point onwards.

A safepoint is a range of execution where the state of the executing thread is well described. Mutator threads are threads which manipulate the JVM heap (all your Java Threads are mutators. Non-Java threads may also be regarded as mutators when they call into JVM APIs which interact with the heap).
At a safepoint the mutator thread is at a known and well defined point in it's interaction with the heap. This means that all the references on the stack are mapped (at known locations) and the JVM can account for all of them. As long as the thread remains at a safepoint we can safely manipulate the heap + stack such that the thread's view of the world remains consistent when it leaves the safepoint.

This is particularly useful when the JVM wishes to examine or change the heap, for a GC or any one of many other reasons. If references on the stack were unaccounted for and the JVM was to run a GC it may miss the fact that some objects are still alive (referenced from the stack) and collect them, or it may move some object without updating it's reference on the stack leading to memory corruption.

A JVM will therefore need means of bringing threads to safepoints (and keeping them there) so that all sorts of runtime magic can happen. Here's a partial list of activities which JVMs run only once all mutator threads are at a safepoint and cannot leave it until released (at a global safepoint), these are sometime called safepoint operations:

Some GC phases (the Stop The World kind)
JVMTI stack sampling methods (not always a global safepoint operation for Zing))
Class redefinition
Heap dumping
Monitor deflation (not a global safepoint operation for Zing)
Lock unbiasing
Method deoptimization (not always)
And many more!

A great talk by Azul's own John Cuthbertson from JavaOne 2014 goes into some background on Safepoints and allot of details about safepoint operations other than GC (we at Azul feel GC is a solved problem, so the talk goes into the remaining reasons to stop).
Note that the distinction between requesting a global safepoint and a thread safepoint exists only for some JVM implementations (e.g. Zing, the Azul Systems JVM. Reminder: I work for Azul). Importantly it does not exist on OpenJDK/Oracle JVMs. This means that Zing can bring a single thread to a safepoint.

To summarize:

Safepoints are a common JVM implementation detail
They are used to put mutator threads on hold while the JVM 'fixes stuff up'
On OpenJDK/Oracle every safepoint operation requires a global safepoint
All current JVMs have some requirement for global safepoints

When is my thread at a safepoint?

So having threads at a safepoint allows the JVM to get on with it's managed runtime magic show, great! When is this groovy state happening?

A Java thread is at a safepoint if it is blocked on a lock or synchronized block, waiting on a monitor, parked, or blocked on blocking IO. Essentially these all qualify as orderly de-scheduling events for the Java thread and as part of tidying up before put on hold the thread is brought to a safepoint.
A Java thread is at a safepoint while executing JNI code. Before crossing the native call boundary the stack is left in a consistent state before handing off to the native code. This means that the thread can still run while at a safepoint.
A Java thread which is executing bytecode is NOT at a safepoint (or at least the JVM cannot assume that it is at a safepoint).
A Java thread which is interrupted (by the OS) while not at a safepoint is not brought to a safepoint before being de-scheduled.

The JVM and your running Java threads have the following relationship around safepoints:

The JVM cannot force any thread into a safepoint state.
The JVM can stop threads from leaving a safepoint state.

So how can the JVM bring all threads into a safepoint state? The problem is suspending a thread at a known state, not just interrupting it. To achieve this goal JVMs have the Java threads suspend themselves at convenient spots if they observe a 'safepoint flag'.

Bringing a Java Thread to a Safepoint

Java threads poll a 'safepoint flag' (global or thread level) at 'reasonable' intervals and transition into a safepoint state (thread is blocked at a safepoint) when they observe a 'Go to safepoint' flag. This is simple enough, but since we don't want to spend all of our time checking if we need to stop the C1/C2 compilers (-client/-server JIT compilers) try and keep safepoint polls to a minimum. On top of the cost of the flag check itself, maintaining a 'known state' adds significant complexity to the implementation of certain optimizations and so keeping safepoints further apart opens up a wider scope for optimization. These considerations combined lead to the following locations for safepoint polls:

Between any 2 bytecodes while running in the interpreter (effectively)
On 'non-counted' loop back edge in C1/C2 compiled code
Method entry/exit (entry for Zing, exit for OpenJDK) in C1/C2 compiled code. Note that the compiler will remove these safepoint polls when methods are inlined.

If you are the sort of person who looks at assembly for fun (or profit, or both) you'll find safepoint polls in the -XX:+PrintAssembly output by looking for:

'{poll}' or '{poll return}' on OpenJDK, this will be in the instructions comments
'tls.pls_self_suspend' on Zing, this will be the flag examined at the poll operation

This mechanism is implemented differently on different VMs (on x86 + Linux, I've not looked at other architectures):

On Oracle/OpenJDK a blind TEST of an address on a special memory page is issued. Blind because it has no branch following it so it acts as a very unobtrusive instruction (usually a TEST is immediately followed by a branch instruction). When the JVM wishes to bring threads to a safepoint it protects the page causing a SEGV which is caught and handled appropriately by the JVM. There is one such special page per JVM and therefore to bring any thread to a safepoint we must bring all threads to a safepoint.
On Zing the safepoint flag is thread local (hence the tls prefix). Threads can be brought to a safepoint independently.

See related post: Dude, Where's My Safepoint? for more details on polls.
Once a thread detects the safepoint flag it will perform the safepoint action that the poll triggers. This normally means blocking on some sort of JVM level lock, to be released when the safepoint operation is over. Think of it as a locking mechanism where:

Threads can lock themselves out (e.g. by calling into JNI, or blocking at a safepoint).
Threads can try and re-enter (when returning from JNI), but if the lock is held by the JVM they will block.
The safepoint operation requests the lock and blocks until it own it (when all the mutators have locked themselves out)

All Together Now

Now we have all the moving parts of the safepoint process:

Safepoint triggers/operations: reasons to request, and stuff to do which requires one or all threads to be at a safepoint
Safepoint state/lock: into which Java threads can voluntarily check in, but can only leave when the JVM is done executing the safepoint operation
Safepoint poll: the voluntary act by a Java thread to go into a safepoint if one is needed

Let's see how it all hangs together, for example when a profiler is sampling the stack using JVMTI::GetStackTrace:

GREEN : Executing Java
YELLOW : Descheduled Java
RED : Native
This diagram highlights several components of safepoint operation overheads:

Time To Safe Point(TTSP): Each thread enters a safepoint when it hits a safepoint poll. But arriving at a safepoint poll requires executing an unknown number of instructions. We can see J1 hits a safepoint poll straight away and is suspended. J2 and J3 are contending on the availability of CPU time. J3 grabs some CPU time pushing J2 into the run queue, but J2 is not in a safepoint. J3 arrives at a safepoint and suspends, freeing up the core for J2 to make enough progress to get to a safepoint poll. J4 and J5 are already at a safepoint while executing JNI code, they are not affected. Note that J5 is trying to leave JNI halfway through the safepoint and is suspended before resuming Java code. Importantly we observe that the time to safepoint varies from thread to thread and some threads are paused for longer than others, Java threads which take a long time to get to a safepoint can delay many other threads.
Cost of the safepoint operation: This will depend on the operation. For GetStackTrace the cost of the operation itself will depend on the depth of the stack for instance. If your profiler is sampling many threads (or all the threads) in one go (e.g. via JVMTI::GetAllStackTraces) the duration will grow as more threads are involved. A secondary effect may be that the JVM will take the opportunity to execute some other safepoint operations while the going is good.
Cost of resuming the threads.

Some take aways from this analysis:

Long TTSP can dominate pause times: some causes for long TTSP include page faults, CPU over-subscription, long running counted loops.
The cost of stopping all threads and starting all threads scales with the number of threads. The more threads the higher the cost. Assuming some non-zero cost to suspend/resume and TTSP, multiply by number of threads...
Add -XX:+PrintGCApplicationStoppedTime to your list of favourite flags! It prints the pause time and the TTSP.

But why should you trust me? Lets break some shit!!! Please note in the following examples I will be switching between Oracle 8u60 and Zing (latest dev JDK8 build). I work for Azul Systems on the Zing JVM, so feel free to assume bias on my part. My main motivation in including both JVMs is so that I can demonstrate the effect of -XX:+KeepSafepointsInCountedLoops on Zing.
NOTE: The benchmarks are not intended to spark a flame war and though Zing happens to compile this particular piece of code 'better' I would not get over excited about it. What is exciting is the extra option Zing makes available here to control and minimize TTSP, and the additional perspective it offers on how a JVM can deal with these challenges.

Example 0: Long TTSP Hangs Application

The above code should exit in 5 seconds (or see it on GitHub). The silly computation running in the daemon thread should not stop it from exiting (the JVM should exit if all remaining running Java threads are daemon threads). But if we run this sample we will find that the JVM does not exit after 5 seconds. It doesn't exit at all[UPDATE: this example turned out to be a bit flaky as per discussion in comments and following blog post]. Not only does it not exit, it will also refuse incoming JMX connections, it won't respond the jmap/jstack. It won't naturally die, you'll have to 'kill -9' this fucker.
The reason it's so stuck is because the computation is running in a 'counted' loop, and counted loops are considered short by the JVM. Short enough to not have safepoints. This thread refusing to come to a safepoint will lead to all other threads suspending indefinitely at the next safepoint operation.
We can fix this (somewhat contrived) problem in several ways:

Prevent C1/C2 compilation: Use -Xint or a compile command to prevent the method compilation. This will force safepoint polls everywhere and the thread will stop hurting the JVM.
Replace the index type with long: this will make the JVM consider this loop as 'uncounted' and include a safepoint poll in each iteration.
On Zing you can apply the -XX:+KeepSafepointsInCountedLoops option on either the JVM or method level (via a -XX:CompileCommand=option,... or via a compiler oracle file). This option is coming shortly to an OpenJDK near you. See also related bug where breaking the loop into an inner/outer loop with safepoints in the outer loop is discussed.

This example is handy when considering shared JVM solutions. This is perfectly 'safe' Java code that will bring most application containers to their knees.
Summary: Delaying a safepoint poll indefinitely can be harmful...

Example 1: More Running Threads -> Longer TTSP, Higher Pause Times

The benchmark has 1 thread triggering a safepoint operation by calling safepointMethod() (in this case the System.gc() will be called as a result) and many threads scanning a byte array. The byte array scan loop will not have a safepoint poll (it's a counted loop), so the size of the array will act as our intended safepoint poll interval. With this setup we can control the frequency of safepoint operations (with some overhead), avg. distance from safepoint poll and with JMH the number of threads. We can use -XX:+PrintGCApplicationStoppedTime to log the stopped time and TTSP. Armed with this experiment we can now find a large machine to play on and see what happens. I use the command line '-tg' parameter to control the number of threads executing the contains1() method.
So, here goes on a 48 core Haswell Xeon machine, Cent OS, I'll be using Oracle 8u60. The benchmark class I'm running is here.
Lets start with no contains1() threads and explain the method and data collection as we go along:
$ java -jar safepoints.jar -f 1 -i 10 -wi 10 -jvmArgs="-XX:+PrintGCApplicationStoppedTime -Xms1g -Xmx1g" -tg 0,0,0,1 SafepointUsingGc

There's 4 thread groups because I add another benchmark to run with this crowd in the next sample.
This spouts allot of PrintGCApplicationStoppedTime printouts similar to:
Total time for which application threads were stopped: 0.0180172 seconds, Stopping threads took: 0.0000513 seconds

I'm no Unix shell guru, but with some googling even I can AWK something to summarize this shit:
grep 'Total time' output.file | awk '{ stopped += $9;ttsp += $14; n++ } END { if (n > 0) print "stopped.avg=" stopped / n ", ttsp.avg=" ttsp / n; }'

It's summing up the warmup period measurements as well, but I'm willing to let it ride. So with that in mind, lets compare the benchmark output with the reported pause times:
SafepointUsingGc.run_together:safepoint 100 10 32768 true avgt 10 118611444.689 ± 1288530.618 ns/op
stopped.avg=0.015759, ttsp.avg=6.98995e-05

Or formatted for humans:
park+gc(ms) |Stp.ms|TTSP.ms|
119 ± 1.0 | 16.1 | 0.07 |

We went with the default intervalMs=100, so we'd expect the safepoint() cost to be around 100 + stopped.avg, it's actually 2ms more than that. This is why we have the park() benchmark to baseline against. Running again with the park group enabled ('-tg 0,0,1,1'):

park(ms) | park+gc(ms) |Stp.ms|TTSP.ms|
101 ± 0.8 | 118 ± 1.4 | 15.9 | 0.07 |

We can see that there's some variance to the GC call and some scheduling variance which can account for delta between our expected value and actual. Small potatoes, lets move on. Start with measuring the cost of the contains1() call when we're not slapping it with a GC every 100ms ('-tg 1,0,1,0' we're running the park baseline):

contains1(us) | park(ms) |Stp.ms|TTSP.ms|
32.2 ± 0.02 | 100.1 ± 0.0 | 0.0 | 0.0 |

Note the contains1() method cost is in microseconds. Awesome, now lets ramp up the number of contains1() thread with the GC thread running every 100ms:

T |contains1(us) | park(ms) | park+gc(ms) |Stp.ms|TTSP.ms|
1 | 51.4 ± 3.1 | 100.7 ± 0.7 | 117.5 ± 0.7 | 15.1 | 0.1 |
2 | 49.5 ± 17.1 | 101.6 ± 1.9 | 116.9 ± 0.5 | 14.4 | 0.1 |
4 | 45.5 ± 2.6 | 102.8 ± 4.0 | 116.9 ± 0.7 | 14.4 | 0.1 |
8 | 45.6 ± 2.1 | 103.6 ± 5.6 | 114.0 ± 0.7 | 13.3 | 0.1 |
16 | 46.8 ± 1.3 | 103.4 ± 6.8 | 115.0 ± 1.5 | 12.8 | 0.2 |
32 | 65.7 ± 2.0 | 101.8 ± 0.9 | 113.0 ± 1.3 | 11.4 | 0.7 |
64 | 121.9 ± 1.9 | 106.0 ± 8.1 | 116.1 ± 1.4 | 14.5 | 5.4 |
128 | 251.9 ± 6.6 | 103.6 ± 5.3 | 119.6 ± 2.3 | 16.9 | 6.4 |
256 | 513.2 ± 13.4 | 109.2 ± 9.7 | 123.4 ± 1.7 | 20.7 | 8.0 |
512 |1111.3 ± 69.7 | 143.9 ±24.8 | 134.8 ±15.3 | 27.8 | 10.0 |

OK, that escalated not so quickly. Keep in mind we only actually have 48 cores (hyper-threaded), and so when we cross the line between 32 and 64 we are now looking at the effect of contention and scheduling on the TTSP. It is evident that if you have more CPU bound threads than CPUs performance will degrade, but that is not really what we're after.
At 1 thread we already see a big drop in performance for contains1(). This is somewhat understandable in the presence of 15ms stolen from the application every 100ms. But where you'd expect a 15% difference we're actually looking at a far more significant drop in performance. Going from 32us to 51us kinda sucks.
We can see a steady growth in TTSP while within the core count, going from 0.1ms to 0.7ms, this is bad news to anyone looking to keep their latencies in the low microseconds count as touched upon in this post from the brilliant Mark Price. Mark highlights the fact that the JVM regularly calls for a global safepoint (“whether she needed it or not”), so you may find to your delight that these little hiccups come round even when you've been allocating nothing.
Once we have more threads than cores the TTSP cost goes into the milliseconds, but escalates rather slowly once there. Going from 5ms to 10ms while we go from 64 to 512 threads is better than I expected.
Summary: By increasing the thread count we increase the TTSP. CPU over-subscription will lead to a further increase in TTSP. Monitor machines load avg to detect this kind of misconfiguration.

Example 2: Long TTSP has Unfair Impact

Building on previous benchmark, but now we have 2 methods: contains1() and contains1toK(). contains1toK() is K times further away from a safepoint poll (outer loop is counted, the contains needle methods gets inlined removing the method exit safepoint and the loop inside it is counted). By increasing K we can see the impact of one thread on the other, in particular we expect the cost of contains1() to increase as the cost of contains1toK() increases, in the presence of safepoint operations. The unfairness here being that while contains1toK() is making progress it is making contains1(legen...) wait for it (...DARY!). Because using GC as the safepoint operation is rather heavy handed (even though it's a small heap and there's no garbage) we'll switch to using another safepoint operation for this case: ManagementFactory.getThreadMXBean().findDeadlockedThreads(). This is allot more lightweight than GC but becomes more expensive as we pile on threads (not by that much, but still) so I did not use it for prior example (though come to think of it the number of threads should also impact GC length a bit, oh well). Anyway, variety is the spice of life. Here's the results with both methods and park(), but no safepoint operation (benchmark class is here):

K |contain1(us)| containsK(us) | park(ms) |Stp.ms|TTSP.ms|
1 | 32.3 ± 0.1 | 34.8 ± 1.7 | 100.0 ± 0.0 | 0.0 | 0.0 |
10 | 32.4 ± 0.5 | 349.7 ± 18.0 | 100.0 ± 0.0 | 0.0 | 0.0 |
100 | 38.4 ± 2.8 | 3452.7 ± 130.4 | 100.0 ± 0.0 | 0.2 | 0.1 |
1000 | 39.8 ± 1.1 | 33419.9 ± 421.6 | 100.0 ± 0.0 | 0.0 | 0.0 |
10000 | 41.7 ± 5.0 | 336112.5 ± 10306.5 | 100.5 ± 1.7 | 0.0 | 0.0 |

Indeed the cost of contains1toK() grows linearly with K and the 2 methods don't seem to impact each other that significantly.
Here's what happens as we throw in the safepoint ops:

K |contains1(us)| containsK(us) | park(ms) | park+FDT(ms) |Stp.ms |TTSP.ms|
1 | 39.7 ± 0.9 | 39.6 ± 1.4 | 100.0 ± 0.0 | 100.3 ± 0.0 | 0.1 | 0.1 |
10 | 40.0 ± 1.0 | 400.1 ± 14.4 | 100.0 ± 0.0 | 100.5 ± 0.0 | 0.2 | 0.2 |
100 | 41.8 ± 1.1 | 3835.4 ± 141.2 | 100.1 ± 0.0 | 102.7 ± 0.1 | 2.1 | 2.0 |
1000 | 67.4 ± 3.5 | 37535.7 ± 2237.2 | 102.5 ± 1.9 | 127.1 ± 1.1 | 20.7 | 20.7 |
10000 | 183.4 ± 8.5 | 346048.6 ± 9818.2 | 194.9 ± 11.5 | 348.4 ± 11.4 | 152.8 | 152.8 |

A terrible thing, right? The annoying/unfair thing is that being a shit has no effect on contains1toK() performance. It scales linearly and seems oblivious to the havoc it's wrecking for everyone else. Note that the problem we see here was present in the run where no safepoint operations were invoked, there were just no safepoint operations to bring it out. On a long enough timeline a safepoint operation will come along and around it we'll have seen the same issue.
How can we fix it, and at what cost? Lets explore some options.

Solution 1: I use Zing -> use -XX:+KeepSafepointsInCountedLoops

So Zing is super awesome, and has this option to keep safepoint polls in counted loops which can come in handy in cases where one particular method is killing you softly with TTSP. Here's the results we get with Zing when running the benchmark above (same jvmArgs, same machine etc):

K |contains1(us) | containsK(us) | park(ms) | park+FDT(ms) | Stp.ms|
1 | 31.9 ± 0.9 | 30.5 ± 1.0 | 100.0 ± 0.0 | 100.2 ± 0.0 | 0.0 |
10 | 32.9 ± 0.2 | 317.8 ± 8.8 | 100.0 ± 0.0 | 100.2 ± 0.0 | 0.0 |
100 | 34.1 ± 2.4 | 3337.8 ± 265.4 | 100.1 ± 0.0 | 100.8 ± 0.2 | 0.4 |
1000 | 44.4 ± 1.9 | 33991.6 ± 1199.0 | 100.9 ± 0.2 | 117.0 ± 0.6 | 10.2 |
10000 | 159.0 ± 12.1 |299839.3 ± 6297.2 | 156.3 ± 8.0 | 319.8 ± 30.5 | 113.4 |

Still bad, but Zing happens to squeeze better performance from the containsNeedle() method and while contains1() performance degrades it's opening a wide gap with observed behaviour above probably because less time is lost to TTSP (because contains1toK() finishes quicker). Also note that -XX:+PrintGCApplicationStoppedTime only prints out the stopped time for Zing. A separate command line argument is available which prints out safepoint statistics, but as this post grows longer I'm losing my will to rerun and parse stuff...
We add the JVM argument: "-XX:CompileCommand=option,*.contains1ToK,+KeepSafepointsInCountedLoops" and watch the world burn:

K |contain1(us)| containsK(us) | park(ms) | park+FDT(ms)| Stp.ms|
1 | 33.0 ± 0.6 | 40.1 ± 1.2 | 100.0 ± 0.0 | 100.2 ± 0.0 | 0.0 |
10 | 32.8 ± 0.7 | 405.1 ± 13.1 | 100.0 ± 0.0 | 100.2 ± 0.0 | 0.0 |
100 | 33.3 ± 0.7 | 4124.8 ± 104.4 | 100.0 ± 0.0 | 100.3 ± 0.0 | 0.1 |
1000 | 32.9 ± 0.2 | 40839.6 ± 854.0 | 100.0 ± 0.0 | 101.0 ± 0.3 | 0.6 |
10000 | 40.2 ± 2.9 | 374935.6 ± 7145.3 | 111.3 ± 3.7 | 100.2 ± 0.0 | 0.0 |

SHAZZZING!... This is pretty fucking awesome. The impact on contains1toK() is visible, but I didn't have to change the code and the TTSP issue is gone man, solid GONE. A look at the assembly verifies that the inner loop is still unrolled within the outer loop, with the addition of the requested safepoint polls. There are more safepoint polls than I'd like here, since the option is applied to both the outer method and the inlined method (to be improved upon sometime down the road...).

Solution 2: I use OpenJDK/Oracle, disable inner loop inlining

If the inner loop is in it's own method as our example, or if we can perhaps modify the code such that it is, we can ask the compiler to not inline that method. This should work out nicely here. To force the JVM to not inline the containsNeedle() method only in contains1toK() I made a copy of containsNeedle() called containsNeedle2() and used this argument: "-XX:CompilerCommand=dontinline,*.containsNeedle2". The results:

K |contain1(us)| containsK(us) | park(ms) | park+FDT(ms)|Stp.ms|TTSP.ms|
1 | 39.9 ± 1.0 | 40.0 ± 2.6 | 100.0 ± 0.0 | 100.3 ± 0.0 | 0.1 | 0.0 |
10 | 39.9 ± .7 | 395.6 ± 10.0 | 100.0 ± 0.0 | 100.3 ± 0.0 | 0.1 | 0.0 |
100 | 41.3 ± 3.9 | 4030.6 ± 405.0 | 100.0 ± 0.0 | 100.3 ± 0.0 | 0.1 | 0.0 |
1000 | 39.6 ± 2.1 | 34543.8 ± 1409.9 | 100.0 ± 0.0 | 100.3 ± 0.0 | 0.1 | 0.0 |
10000 | 39.1 ± 1.2 | 328380.6 ± 8478.1 | 100.0 ± 0.0 | 100.3 ± 0.0 | 0.1 | 0.0 |

That worked well! No harm to contains1(), no TTSP issue, and contains1toK() performance didn't suffer for it. Inlining is usually a key optimization you want to make sure happens, but in this case the right thing to do is stop it from happening. The work inside containsNeedle() is chunky enough to make this a good tradeoff. If the external loop was at a very high count, and the array size rather small this would have been a very different story.

Solution 3: Last resort. stop compiling the offending method

This is a bit terrible, but if things get bad AND you can't change the code, you can use a CompileCommand to exclude the method from compilation. The method will be force to stay in the interpreter, so will effectively have a safepoint poll between every 2 bytecodes. Performance will suck, but at least the application will be responsive. I am not gonna run this option, but if you want to add the following argument to the command line and watch it roll: "-XX:CompilerCommand=exclude,*.contains1toK"

Summary: Threads which suffer from large TTSP will cause other threads to wait for them in the presence of a global safepoint attempt.

Example 3: Safepoint Operation Cost Scale

For this example we replace the safepoint() method with gathering thread stacks using ManagementFactory.getThreadMXBean().dumpAllThreads(false, false). Gathering thread stacks also occurs at a safepoint and the safepoint operation cost will scale with the number of threads. We construct a number of idle threads in the setup method to control the safepoint operation cost. To make sampling the threads stack expensive we park the fuckers 256 frames deep.
Different safepoint operations scale with different aspects of your process. Certain GC algorithms stop the world phases will scale with heap size. Profiling by collecting thread dumps (as done by JVisualVM sampler for instance) will scale with the number of threads (idle or not) and the stack depth. Deflating monitors will scale with the number of monitors in your application etc.
Here's what happens as we pile on the idle threads (running only contains1()):

IT | contain1(us)| park(ms) | park+DTS(ms) | Stp.ms|TTSP.ms|
0 | 38.0 ± 1.6 | 100.0 ± 0.0 | 100.5 ± 0.0 | 0.2 | 0.0 |
1 | 40.7 ± 1.4 | 100.1 ± 0.0 | 102.2 ± 0.0 | 0.6 | 0.0 |
10 | 44.7 ± 0.8 | 100.2 ± 0.1 | 117.2 ± 0.1 | 5.5 | 0.0 |
100 | 75.5 ± 0.8 | 122.6 ± 3.5 | 261.9 ± 2.3 | 61.3 | 0.1 |
1000 | 204.5 ± 38.5 | 257.5 ± 28.8 | 1414.2 ± 47.0 | 313.0 | 1.0 |

We get long stops and short TTSPs! hoorah! You can play with the number of threads and the depth of the stack to get the effect you want. It is interesting that most profilers using JVMTI::Get*StackTrace opt for sampling all threads. This is risky to my mind as we can see above the profiling overhead is open ended. It would seem reasonable to sample 1 to X threads (randomly picking the threads we sample) to keep the overhead in check on the one hand and pick a reasonable balance of TTSP vs operation cost on the other.
As mention previously, JVMTI::GetStackTrace does not cause a global safepoint on Zing. Sadly AFAIK no commercial profilers use this method, but if they did it would help. The JMX thread bean does expose a single thread profiling method, but this only delegates to the multi-thread stack trace method underneath, which does cause a global safepoint...

Summary: This final step adds another interesting control for us to play with completing our set:

Safepoint poll interval -> Will drive time to safepoint. CPU scarcity, swapping and page faults are other common reasons for large TTSP.
Safepoint operation interval -> Sometimes in your control (be careful how you profile/monitor your application), sometimes not.
Safepoint operation cost -> Will drive time in safepoint. Will depend on the operation and your application specifics. Study your GC logs.

Example 4: Adding Safepoint Polls Stops Optimization

This is a repeat of a benchmark I ran to demonstrate the side effect of safepoint polls in loops on loop optimizations in the context of changing one of the Netty APIs. This post is so long by now that if you had the strength to read this far you can probably click through to the ticket. I'll quote the basic results here for completeness.
The benchmark:
We are debating replacing the length and DataSet index types from int to longs. The impact on usage inside loops is substantial as demonstrated by the following results:

Benchmark (size) Mode Cnt Score Error Units
copyInt 1000 avgt 5 76.943 ± 14.256 ns/op
copyLong 1000 avgt 5 796.583 ± 55.583 ns/op
equalsInt 1000 avgt 5 367.192 ± 16.398 ns/op
equalsLong 1000 avgt 5 806.421 ± 196.106 ns/op
fillInt 1000 avgt 5 84.075 ± 19.033 ns/op
fillLong 1000 avgt 5 567.866 ± 10.154 ns/op
sumInt 1000 avgt 5 338.204 ± 44.529 ns/op
sumLong 1000 avgt 5 585.657 ± 105.808 ns/op

There are several optimizations negated by the transition from counted to un-counted loop (uncounted loops currently include loops limited by a long, and therefore include a safepoint):

Loop unrolling : This is not really a safepoint issue, can be fixed as per the KeepSafepointsInCountedLoops option
Fill pattern replacement (-XX:+OptimizeFill): the current fill implementation has no safepoints and the loop pattern it recognizes does not allow it.
Superword optimizations (-XX:+UseSuperword): current Superword implementation doesn't allow for long loops. This will probably require the compiler to construct an outer loop with an inner superword loop

The latter 2 have been discussed on this blog before in some detail.

Final Summary And Testament

If you've read this far, you deserve a summary, or a beer, or something. WELL DONE YOU!!! Here's what I consider the main takeaways from this discussion/topic:

Safepoint/Safepoint poll/Safepoint operation are 3 distinct terms, often mingled. Separating them allows us to consider their costs and impacts individually.
In most discussions the term Safepoint is used to describe a global safepoint. Some JVMs can control safepoint transition on the thread level. By considering the individual thread state we can more easily reason about the transition costs on a thread by thread level.
The failure of any thread to reach a safepoint (by hitting a safepoint poll) can cripple the whole JVM. In this sense most any bit of Java code can be considered blocking (as in not lock free). Consider for example a CAS loop, it's not counted, it will have a safepoint poll in it, the thread executing can be blocked indefinitely if another Java thread is suspended while not at a safepoint.
JVMs have many safepoint operations. If you are worried about their impact you should enable -XX:+PrintGCApplicationStoppedTime. Further detail on what types of safepoint operations were involved can be made available using -XX:+PrintSafepointStatistics –XX:PrintSafepointStatisticsCount=X. Try it out, it's educational.

References

From the OpenJDK Runtime Overview look for the "VM Operations and Safepoints" section. Note that the term safepoint is used here to mean a global safepoint.
J.Bempel's post covering many safepoint operations and observing the impact of safepoints.
Alexey Ragozin's post also covers similar definition to the Runtime Overview and explores the safepoint poll mechanism and logging options.
Gil Tene post on the Mechanical Sympathy mailing list which closely matches the "Thread State" approach above. Gil goes into the topic in the context of Unsafe operations and assumptions surrounding safepoints.

Expanding The Queue interface: Relaxed Queue Access

2015-10-19T09:46:00.002+01:00

Continuing from previous post on the expansion of the Queue interface to support new ways of interacting with queues I have gone ahead and implemented relaxedOffer/Poll/Peek for the JCTools queues. This was pretty easy as the original algorithms all required a bit of mangling to support the strong semantic and relaxing it made life easier. Implementing batch/continuous interactions was a bit more involved, but the results were interestingly rewarding. I will save discussing the drain/fill results for a follow up post.
The new interface is here, feedback is welcome.
TL;DR:

Throughput is improved in all cases (~5-15% increase)
Cost of failure (when offer/poll fail to return/add a value) is dramatically reduced
Latency for bursts is improved (as much as 40% reduction)

Relaxed Access API

This ended up a very small API change, most of the difficulty here being choice of name. I went with the 'relaxed' prefix as suggested by J.P.Bempel (as opposed to the 'weak' prefix considered earlier):

Relaxed vs. Non-Relaxed Example

The 'relaxation' of the queue full/empty reporting in these methods allows us to remove an extra load of the 'opposite' index and avoiding waiting around for delayed visibility. Here's how MPSCArrayQueue::poll was changed to become 'relaxed':

So now we can drop lines 11-15, it doesn't look like much. We wouldn't really expect this gap in offer (see offer for details) to happen very often, so the overhead can typically amount to an extra load of the producerIndex and a predictable branch.
Also, since poll on an MPSC is actually cheaper than the offer anyhow, aren't we optimizing a non-issue? Could we not be actually making throughput worse by making the consumer more aggressive?

Throughput Impact Of Relaxation

To test the impact of this change I constructed a JMH benchmark that measures throughput when both consumer and producer spin to offer/poll as fast as they can (see the code here). I ran the benchmark on a Haswell machine(i7-4770 CPU@3.40GHz/Ubuntu/OracleJVM8u60). The benchmarks were pinned to run cross core (no 2 threads share a physical core, i.e. no hyperthreading). Benchmarks were run with 5 forks, 10 warmup iterations (1 second), 10 measured iterations (1 second). The score reports operations per microsecond, or millions of operation per second if you prefer. The throughput is the number of pollsMade, but the other figures are interesting as well.

These results are with a single producer and consumer:

Type | 1P1C | Normal± Err | Relax ± Err |

Spsc | offersFailed | 0.0 ± 0.1 | 0.1 ± 0.1 |

Spsc | offersMade | 439.1 ± 4.1 | 432.1 ± 10.0 |

Spsc | pollsFailed | 0.2 ± 0.5 | 0.1 ± 0.2 |

Spsc | pollsMade | 438.9 ± 4.1 | 431.9 ± 10.0 |

Mpsc | offersFailed | 0.0 ± 0.0 | 0.0 ± 0.0 |

Mpsc | offersMade | 22.7 ± 0.6 | 31.9 ± 0.8 |

Mpsc | pollsFailed | 0.1 ± 0.1 | 6.6 ± 0.9 |

Mpsc | pollsMade | 22.8 ± 0.6 | 31.9 ± 0.8 |

Spmc | offersFailed | 0.2 ± 0.2 | 81.0 ± 5.4 |

Spmc | offersMade | 23.5 ± 0.6 | 26.8 ± 0.6 |

Spmc | pollsFailed | 0.1 ± 0.1 | 0.2 ± 0.3 |

Spmc | pollsMade | 23.2 ± 0.6 | 26.6 ± 0.6 |

Mpmc | offersFailed | 0.0 ± 0.0 | 0.0 ± 0.1 |

Mpmc | offersMade | 71.4 ± 5.1 | 71.7 ± 4.8 |

Mpmc | pollsFailed | 0.0 ± 0.1 | 0.3 ± 0.9 |

Mpmc | pollsMade | 71.4 ± 5.1 | 71.7 ± 4.8 |

The SPSC case serves as a baseline, and I expected results to be the same. There's a 2% difference which is close to the error reported. Maybe the inlining depth impacted the performance slightly, or perhaps the benchmark is not as stable as I'd like.
The MPSC case shows 50% improvement. This is pretty awesome. We also notice the consumer now fails more. This is indicative to the reduced cost of failure to find elements in the queue. The cache miss on the producerIndex was slowing down both the consumer and the producer. This phenomena has been observed in other post, but my best attempt at an explanation is here. The bottom line being that the mutator suffers from losing the exclusive state of the cache line in his own cache, while the reader gets hit with a read miss (going to LLC as the threads are on separate cores).
The SPMC case shows 15% improvement on throughput. The cost of failing to offer has gone down dramatically.
The MPMC case shows no change, but does show great throughput. This could be down to the very different algorithm in play or just because the MPMC queue is well balanced and thus behaves very well in a balanced use case as above. This may lead you to believe MPMC is a better choice than MPSC, but as the latency results and contended throughput results show this is an anomaly of the benchmark load. Because MPSC presents a much cheaper poll than offer the queue is always empty, making the contention on the queue buffer worse than it would be for MPMC where offer and poll are of similar costs.

Using the command line controls over thread groups sizes (-tg) I ran a few more cases. Did I mention I love JMH?
Here's MPMC and MPSC with 3 producers and 1 consumer:

Type | 3P1C | Normal± Err | Relax ± Err |
Mpsc | offersFailed | 0.0 ± 0.0 | 0.1 ± 0.6 |
Mpsc | offersMade | 12.7 ± 0.1 | 13.4 ± 0.1 |
Mpsc | pollsFailed | 4.1 ± 0.5 | 115.3 ± 4.9 |
Mpsc | pollsMade | 12.7 ± 0.1 | 13.4 ± 0.1 |
Mpmc | offersFailed | 0.0 ± 0.0 | 0.0 ± 0.0 |
Mpmc | offersMade | 11.1 ± 0.0 | 11.7 ± 0.1 |
Mpmc | pollsFailed | 0.3 ± 0.1 | 60.4 ± 10.0 |
Mpmc | pollsMade | 11.1 ± 0.0 | 11.7 ± 0.1 |

MPSC case shows slight improvement to throughput of roughly 5%, but the improvement to cost of failure is very visible here.
MPMC case improves by 5% and the improved cost of failure is similarly visible.

Here's MPMC and SPMC with 1 producers and 3 consumer:

Type | 1P3C | Normal± Err | Relax ± Err |
Spmc | offersFailed | 1.5 ± 0.2 | 125.9 ± 4.6 |
Spmc | offersMade | 13.4 ± 0.1 | 14.0 ± 0.0 |
Spmc | pollsFailed | 0.0 ± 0.0 | 0.0 ± 0.0 |
Spmc | pollsMade | 13.1 ± 0.1 | 13.7 ± 0.0 |
Mpmc | offersFailed | 23.9 ± 0.8 | 124.1 ± 3.2 |
Mpmc | offersMade | 11.0 ± 0.0 | 12.1 ± 0.0 |
Mpmc | pollsFailed | 0.0 ± 0.2 | 1.0 ± 1.6 |
Mpmc | pollsMade | 10.7 ± 0.0 | 11.9 ± 0.0 |

SPMC case shows 5% improvement and great improvement to failure performance.
MPMC case improves by 11% and the improved cost of failure is similarly visible.

Here's MPMC only, with 2 producers and 2 consumers:
Type | 2P2C | Normal± Err | Relax ± Err |
Mpmc | offersFailed | 1.4 ± 0.8 | 0.7 ± 0.6 |
Mpmc | offersMade | 14.8 ± 0.8 | 16.0 ± 0.6 |
Mpmc | pollsFailed | 0.0 ± 0.0 | 0.4 ± 0.5 |
Mpmc | pollsMade | 14.7 ± 0.8 | 15.9 ± 0.6 |

The improvement to throughput is a nice 8%, but the error is not that far from it so I'm not too confident without lots more runs. This is a similarly symmetric load to the 1P1C and the balance is showing through.

I could go on to higher core counts etc, but I'm actually feeling the win here is pretty conclusive:

Throughput is generally improved
Failure to make progress is significantly cheaper

Why is this failure mode important? This is particularly important when you have a consumer/producer which is able to make progress in the face of such a failure. A common example of this pattern is a Selector thread which also has an incoming queue of commands to execute, when there's nothing in the queue we would like to go back to selecting on the selector.

Latency impact of Relaxation

Is this going to help us in the case of burst handling latency? There's a benchmark for that too! Or rather there's an estimate of burst processing cost. Code is here. The general idea being to send a burst of messages from the producer, then wait for completion notice from the consumer.
Here goes:
Size | Type | Normal±Err | Relax± Err |
1 | Spsc | 126 ± 2 | 126 ± 1 |
1 | Mpsc | 144 ± 2 | 140 ± 1 |
1 | Spmc | 187 ± 2 | 189 ± 2 |
1 | Mpmc | 190 ± 8 | 203 ± 2 |
10 | Spsc | 193 ± 2 | 192 ± 3 |
10 | Mpsc | 611 ± 22 | 429 ± 7 |
10 | Spmc | 309 ± 3 | 307 ± 1 |
10 | Mpmc | 807 ± 19 | 677 ± 15 |
100 | Spsc | 530 ± 7 | 528 ± 4 |
100 | Mpsc | 5243 ± 158 | 3023 ± 107 |
100 | Spmc | 1240 ± 7 | 1250 ± 6 |
100 | Mpmc | 4684 ± 382 | 3665 ± 111 |

Same as before, SPSC serves as a sanity check/baseline. No change is good.
SPMC/MPSC results are pretty similar when sending a single message. MPMC seems slightly regressed.
When sending 10/100 messages in a burst we see significant benefits to using the relaxed methods, in particular for the MPSC and MPMC cases.

Summary

Relaxing is good for you, deep down you always knew it. Turns out it ain't half bad for your queues either :-). The new API will be part of the next version of JCTools, coming out as soon as I can work through my Maven mojo phobia.

Thanks J.P. Bempel and Phil Aston for their kind review, remaining errors are a result of me refusing to listen to them persnickety bastards.

An extended Queue interface

2015-08-19T16:39:00.000+01:00

In my work on JCTools I have implemented a fair number of concurrent access queues. The Queue interface is part of the java.util package and offers a larger API surface area than I found core to concurrent message passing on the one hand, and still missing others. I'm hoping to solicit some discussion on some new methods, and see if I can be convinced to implement those I decided to avoid thus far.

Inter-thread Message Passing vs. Generic Collection

The queue data structure traditionally has an offer/push/enqueue method for writing into the queue and a poll/pop/dequeue method for removing elements. This should be enough for anyone, right?

Turns out some other useful methods come to mind when using a queue:

peek: looking, not touching
size: for monitoring and heuristics and so on
isEmpty: a potentially more efficient assertion than size == 0

But I would argue that implementing the full scope of the java.util.Collection methods is slightly redundant. Why is it good to have offer/poll also exposed as add/remove/element?

I have chosen to avoid implementing iterators for queues in JCTools, they seem to introduce a large complexity and I'm not convinced users out there need an iterator for their concurrent queue. The duplicate collection methods I get from extending AbstractQueue for free are alright though I don't see why anyone would actually use them.
So lots of methods I don't want, but also some methods missing, in particular for concurrent queue semantics. In particular I have come to the conclusion the following are both practical to implement and helpful to users:

weakOffer/Poll/Peek: the same as the 'strong' methods but without the requirement that allowing for spurious negative return value. This is particularly useful for queues where we are able to detect the inability to claim the next slot independently of the queue being full or empty. The 'weak' prefix is different from the Atomic* classes definition of 'weak' as it has no ordering implications, but I've been unable to come up with a better name(suggestions are most welcome).
drain(receiver)/fill(provider): this is attractive for cases where a receiver/provider function exists such that calls to it are guaranteed to succeed (e.g. receiver could just throw elements away, producer might return a constant). There are some small savings to be made here around the function calling loop, but for some these small savings might prove significant. A perpetual version of these methods could also take some sort of backoff strategy and spin when hitting an empty/full condition (similar to the Disruptor's BatchProcessor workflow).
batchPoll/Offer: this is attractive for processing known number of elements out of/into a queue. This is particularly profitable when multiple consumer/producers contend as the granularity of contention is reduced. It is also potentially damaging as batches can introduce large 'bubbles' into the queue and increase the unfairness. These methods could support a receiver/provider interface or use an array (enables batch array copies, guarantees quick release of a 'bubble').
offeredElementsCount/polledElementsCount: this is purely for reporting and would tend to be a long reflecting the producer/consumer counter. For linked queues this adds a new overhead, so while this is a clearly useful feature I'm reluctant to make it a compulsory part of the new interface.

I plan to implement the above for JCTools in the near future. What do you think? Please opine in the comments or via the projects suggestion box.

Thanks :)

JMH perfasm explained: Looking at False Sharing on Conditional Inlining

2015-07-27T08:26:00.000+01:00

There is an edge that JMH (read the jmh resources page for other posts and related nuggets) has over other frameworks. That edge is so sharp you may well cut yourself using it, but given an infinite supply of bandages you should definitely use it :-) This edge is the ultimate profiler, the perfasm (pronounced PERF-AWESOME!, the exclamation mark is silent). I've been meaning to write about it for a while and as it just saved my ass recently...

SpscGrowableArrayQueue's False Sharing Issue

JCTools includes a specialized SPSC (single-producer/consumer) bounded queue aimed at actor systems, the SpscGrowableArrayQueue. This queue is quite neat because it is combining the compactness of a linked queue with the awesome throughput of an array backed queue. The idea is quite simple, and similar in spirit to an ArrayList:

Start with a small buffer, default to 16 elements
As long as the queue doesn't grow beyond that just stay small
If on offer you find that the queue is full double the size of the underlying buffer

The mechanics of the queue resizing, handling intermediate queue states, and detecting a new buffer from the consumer are a bit involved and will perhaps be expanded on some other day.
Because the queue is geared at actor systems per queue footprint is important, so I reworked the memory layout of SpscArrayQueue and skimped on the padding where possible. If you have no idea what I mean by padding, and why it is done you can read the following posts:

False sharing and the MESI protocol details related to cached counters: This explores the motivation of padding fields in the first place
Discovering and controlling object layout(Using an early version of JOL): This show cases the JOL tool and how it should be used to discover object layout
SPSC Revisited - part I: An empiricist tale: This post discusses the inlining of counters into a queue use inheritance to control field order and introduce padding. In this post I show how to pad the counter from each other and also how to pre/post pad the class and buffer.
A more high level summary of false sharing is given in this post

With an overhead of roughly 128b per padding this was instrumental to reducing the high memory cost per queue. The hot counters were still padded from each other, but the class pre/post padding were removed as well as the padding of the elements array.
So, I gave up on some of the padding, but reasoned that in most cases this should not make a difference because the hottest fields were still padded from each other.

Why so slow?

Now, I expect a queue that is very similar to SpscArrayQueue, but adds some features to be slower. There's just very little you can do about this, doing more will cost you something. But given that resizing is a rather exceptional event for this queue I thought this will be a minor hit, maybe 10-20% reduction in performance (for a certain definition of performance). JCTools has some benchmarks included and I ran the QueueThroughputBackoffNone which will have a producer and consumer threads chewing on the queue as hard as they can.
Since the numbers are not very important here, I'll stick to running benchmarks on my laptop (an MBP 11,1/ Ubuntu 15.04/Zulu JDK8u45 - Zulu is a supported and tested OpenJDK build) in this post. Rest assured that I have confirmed on real hardware the same results. The results I quote below are the pollsMade figure which reflects the actual delivered throughput.

To my puzzlement I found:

SpscArrayQueue 361.223 ± 7.156 ops/us

SpscGrowableArrayQueue 64.277 ± 31.803 ops/us

Crap performance and terrible variance, bugger me sideways.
Looking at the code, I thought there must be something I was doing to upset the mighty JIT spirits. Maybe my methods were too big? My branches too unpredictable? My variable names too offensive? So I tweaked the code this way and that, looked at the inlining log (-XX:+PrintInlining) and the assembly (-XX:+PrintAssembly/-XX:CompileCommand=print,*) I got some minor improvements, but it mostly still sucked. What's wrong? A quick look at "-prof perfasm" and a fair amount of head scratching lead to the answer. The code (before and after) is on github, the main focus for this post is perfasm and its usage so I won't dive into it.

Before we start: How does perfasm work?

To use perfasm you'll need a Linux(or Windows) OS, running on real hardware, the relevant perf tool installed, and a JVM setup to print assembly.

Perf is (amongst other things) an instruction level profiler, of the kind that don't usually work with Java traditionally (though things are slowly changing). One of the features offered by perf is the "perf record/annotate" workflow. Perf interrupts your process repeatedly and records the current PC (program counter) at each sample. This sampling of the program counter is recorded over a period of time to be post processed by the annotate feature which correlates samples to methods, lines of code and assembly instructions. The challenge for perf when dealing with Java code is that the binary form of each method only exists for the duration of that process lifetime. This means the PC is mostly referring to methods that are nowhere to be found when the annotation stage comes along.

To summarize: perf record works, but perf annotate is broken for Java.

To make perfasm work JMH captures the JVM compiler outputs by enabling the following flags:

-XX:+UnlockDiagnosticVMOptions

-XX:+LogCompilation

-XX:LogFile=...

-XX:+PrintAssembly

-XX:+PrintInterpreter

-XX:+PrintNMethods

-XX:+PrintNativeNMethods

-XX:+PrintSignatureHandlers

-XX:+PrintAdapterHandlers

-XX:+PrintStubCode

-XX:+PrintCompilation

-XX:+PrintInlining

-XX:+TraceClassLoading

-XX:PrintAssemblyOptions=syntax

The data collected here allows perfasm to do the annotation process by itself. I'll not bore you with the details of the output processing, but hat off to Mr. Shipilev who ploughed through the gory mess and made this work. The end result is a detailed output offering you the assembly of the hottest regions in your program, along with a list of the top hottest methods including both native and Java parts of the stack.
Because perf allows the recording of any number of events the definition of hottest depends on the events you choose to profile. The default events are cycles and instructions, but you can specify any number of events (E.g. -Djmh.perfasm.events=cycles,cache-misses). The first event specified will be the 'hotness' qualifier.

What do you get?

The perfasm output is split into 4 sections:

Annotated assembly output for the top HOT regions(titled "Hottest code regions (>10.00% "cycles" events):")
A list of the hot regions in your benchmark (titled "[Hottest Regions]"). This is the list of compiled methods inclusive of native methods, which is why we have the next section.
A list of the hottest methods after inlining, so only java methods in this section (titled "[Hottest Methods (after inlining)]").
A distribution of the cycles between types of regions (titled "[Distribution by Area]"). This will inform you of how the split goes between compiled code, kernel, JVM etc.

The most interesting of the 4 sections (and the only one I'm going to explain here) is the annotated hot regions section. I tend to edit the output of assembly spouting tools to make it more readable, but for the sake of explaining what's what here's the original output with footnotes:

To focus on the main features I trimmed out allot of the assembly code, which is where you see the "<MORE OF SAME/>" comments. Note the legend at the start of the output describing the 2 left most columns: cycles and instructions sample percentages.

Here's the footnotes explained:

(Line 7) This column is the instruction address. This is important as this column is both how perfasm matches the perf record output with the PrintAssembly output and how poor sods like yours truely have to look for jump destinations in the code.
(Line 7) This column has the actual instructions, in AT&T syntax (src -> dst).
(Line 7) This column is the instruction annotation generated by PrintAssembly. It is not always correct, but it's helpful in connecting the dots. Multi-level inlining of methods can be seen in action everywhere.
(Line 12) The condition/loop edge annotation is a wonderful recent addition to perfasm. It makes life marginally more bearable for the assembly consumer by connecting the jumps with their destinations. This is the start point annotation "╭"
(Line 16) This is the end point annotation "↗"

When looking at this you will be searching for suspect hot spots in the code. Some instruction which is chewing away your cycles. If you're in luck you'll be able to tweak your code just right to side step it and then you are off and away :-)

Back on Track: Show me the cycles!

Right, so I point Captain Awesome at my benchmark and what do I see (trimmed addresses and edges as there's no important conditions here, also trimmed package names and moved code annotation left):

This instruction, eating 30% of my cycles (line 7, needs HEALING!), is part of a guard generated to ensure that inlining the poll method call is still a valid decision, that the instance observed at the callsite is of the class for which we inlined the poll method. This is not really a Java code line, this is the aftermath of inlining a virtual call (through an interface, see Aleksey's method dispatch inlining article to learn more on inlining).
To be fair, the above is kind of hard to understand. Why is a comparison between a constant and a register eating 30% of the cycles? according to every instruction 'cost' manual this should take 1 cycle (and the CPU can do 3-4 of them in parallel too), this is obviously not the instruction I'm looking for.
This is a phenomena known as 'skid' where the reported instruction at a given sample is inaccurate because modern CPUs are complicated. See the following dialogue on the linux-perf-user mailing list:

> I think Andi mentioned this to me last year -- that instruction profiling was no longer reliable.

It never was.

> Is this due to parallel and out-of-order execution? (ie, we're sampling the instruction pointer, but that's set to the resumption instruction, not the instructions being processed in the backend?).

Most problems are due to 'skid': It takes some time to trigger the profiling interrupt after the event fired. [...] There are also other problems, for example an event may not be tied to an instruction. Some events have inherently large skid.

This sounds grim, but at the end of the day this is as accurate as profiling can get. It's not perfect, but it's still very valuable and you get used to it (there are ways to minimize skid discussed in the link, I've not tried those).
What you end up doing is looking for instructions just before the blamed instruction, or instructions on which the blamed instruction is dependant, which may be more reasonably blamed for the bottleneck. Given the CMP is not the problem, we must ask why would the CPU spend so much time at it? A CMP or a TEST will often get blamed for the price of the load into the registers they use, in this case the CMP is most probably being blamed for the load of the queue type from the object header one instruction back:
"0x8(%r12,%r11,8),%r8d ; implicit exception: dispatches to 0x00007f83c93f929d"
It doesn't help that the comment talks about some implicit exception (this is an implicit null check), where it could say "getkid(q) + implicit_nullchk(q)" or something similar to indicate we are loading the kid (klass id) from the object header (see object header layout details here).
Now that I had pointed a finger at this operation, I was still quite confused. This was not an issue for any of the other queues, why would loading the kid be such a bottleneck in this case? Maybe I'm wrong about this (always a good assumption)? To prove this is the issue I created a duplicate of the original benchmark where I use the SpscGrowableArrayQueue directly instead of going via the interface, for comparison I also benchmarked the SpscArrayQueue in the same fashion:
SpscArrayQueue 369.544 ± 1.535 ops/us
SpscGrowableArrayQueue 272.021 ± 12.133 ops/us
Now that's more like it! the expected 20% difference is more like 30%, but this is much closer. This still begs the question, why is the type check for SpscGrowableArrayQueue so expensive? We can see that for SpscArrayQueue this makes very little difference, how is SpscGrowableArrayQueue different?

Messing with the KID

I had a long hard look at this issue, which didn't help, then slept on it, which did help, and realized the problem here is that the object header is false-sharing with the producer fields. When I trimmed down the padding on this class in an effort to minimize allocation, I removed the array post and pre-padding as well as the class pre/post padding and reasoned that for the most part I need not worry about the object's neighbours false sharing. What I failed to realize was that the consumer and producer threads might be frequently hitting the object header, in this benchmark on every call. Once I realized this was the issue I reinstated the pre-padding such that the producer index is far enough from the object header to stop interfering with it and the problem went away (see before and after, it's a messy diff as I fixed some other niggles while I was there, you can take the last version and play with adding/removing the padding and reordering the index and cold fields to verify my claims).
Here's the (slightly trimmed for readability) perfasm output for the same conditional inlining check in the padded version:

As an interesting side note, this issue was not visible in the hand rolled throughput benchmarks. This is because in those benchmarks the queue is hoisted into a variable before the consume/produce loop which means the conditional inlining check can be done out of loop as well. This is great for getting nice numbers but hides an issue which users are likely to hit. Credit goes to JMH for herding benchmarks down a path which forces these issues into the measured scope.

Summary

The main focus of this post is perfasm, I hope it helps get you started on using it. The broader context in which you would use this tool is a less explicit background. I use perfasm regularly, but I'm also reasonably happy to look at assembly and compiler annotation which I know most people are not. I find it to be invaluable in issues, like the one described in this post, where the cost is between the Java lines rather than the lines themselves. Any profiler can give you a broad feel of where performance bottlenecks are, and a non-safepoint-biased profiler can show you the hottest line of code. What a Java profiler will not tell you is about the generated JVM code between your lines. It will attribute the cost of those instructions to the nearest Java line, and that can become a very confusing chase.
It also worth pointing out that any nano-benchmark (measuring operations in the 0-500 nanoseconds range) practically requires you to look at the assembly for it's analysis. But 500ns can still be an aweful lot of code and an assembly level profiler is very handy. At this point it worth mentioning the Oracle Solaris Studio (first prize for least known profiler every year since release). It is a great assembly level profiler, and just generally a great profiler. If your measurement needs to take place outside the cosy comforts of a JMH benchmark I would recommend you give it a spin.
Finally, this investigation came in the context of a workflow that is followed in the development of JCTools. I would loosely describe it as follows:

Implement new feature/data structure
Reason about expected impact/performance as part of design and implementation
Test expectations using the existing set of benchmarks
Expectations are far off the mark (if not... not sure, will figure it out when it happens)
Dig in until either expectations or code are cured.

This has a whiff of scientific exploration to it, but I assure you it is not done quite so seriously and I often fail to follow my own advice (or worse other people's advice). The habit of testing performance assumptions/expectation has offered me many an afternoon spent banging my head on a variety of surfaces. Perfasm has been instrumental in reducing the amount of head banging, but I fear nothing short of permanent brain damage will actually solve the problem.
This post has been kindly reviewed by Aleksey Shipilev, Darach Ennis and Richard Warburton. Any remaining errors are entirely their fault ;-)

Object.equals, primitive '==', and Arrays.equals ain't equal

2015-05-22T06:35:00.000+01:00

It is a fact well known to those who know it well that "==" != "equals()" the example usually going something like:
String a = "Tom";
String b = new String(a);
-> a != b but a.equals(b)

It also seems reasonable therefore that:
String[] arr1 = {a};
String[] arr2 = {b};
-> arr1 != arr2 but Arrays.equals(arr1, arr2)

So far, so happy... That's examples for you...
For primitives we don't have the equals method, but we can try boxing:
int i = 0;
int j = 0;
-> i == j and also ((Integer)i).equals((Integer)j), and Arrays.equals({i}, {j})

Floats ruin everything

But... some primitives are more equal then others. Consider for instance the following example:
float f1 = Float.Nan;

float f2 = Float.Nan;

-> f1 != f2, and also f1 != f1, Nan's are nasty

This is what floats are like children, they be treacherous little hobbitses and no mistake. The fun starts when you box them fuckers:

-> ((Float)f1).equals((Float)f2), because Java likes to fix shit up

-> Arrays.equals({((Float)f1)},{((Float)f2)}), hmm consistent

-> but also: Arrays.equals({f1},{f2})...

This is counter to how one would normally think arrays are compared. You would think that for primitives (skipping arrays null and length checks):

boolean feq = true;

for(int i=0;i<farr1.length;i++){

if(farr1[i] != farr2[i]) {

feq = false; break;

}

Is the same as:

boolean feq = Arrays.equals(farr1,farr2);

But for double[] and float[] the contract has been changed to accommodate Nan. This is perhaps a good decision on the JDK authors side, but it is somewhat surprising.

Conclusion: 2 arrays can be equal, even if some elements are not equal.

Objects are weird

Let's consider objects for a second. We started with a != b does not predicate that !a.equals(b), but what about a == b?

The Object.equals() javadoc offers the following wisdom:
The equals method implements an equivalence relation on non-null object references:

It is reflexive: for any non-null reference value x, x.equals(x) should return true.
It is symmetric: for any non-null reference values x and y, x.equals(y) should return true if and only if y.equals(x) returns true.
It is transitive: for any non-null reference values x, y, and z, if x.equals(y) returns true and y.equals(z) returns true, then x.equals(z) should return true.
It is consistent: for any non-null reference values x and y, multiple invocations of x.equals(y) consistently return true or consistently return false, provided no information used in equals comparisons on the objects is modified.
For any non-null reference value x, x.equals(null) should return false.

You know what the most important word in the above lines is? SHOULD

In the wonderful land of Should, all classes behave and if they bother implementing an equals method they follow the above rules (and also override the hashcode method while they are there).

But what happens when an object is implement perversely, like this:

public class Null {

@Override

public boolean equals(Object obj) {

return obj == null;

}

This has some nice implications:

Null n = new Null();

-> n != null, but n.equals(null)

Object[] a = {n};

Object[] b = {null};

Object[] c = {n};

-> Arrays.equals(a, b) is true, because n.equals(null)

-> Arrays.equals(b, a) is false, because null != n

-> Arrays.equals(a, a) is true, because a == a

-> Arrays.equals(a, c) is false, because !n.equals(n)

Or to quote The Dude: "Fuck it man, let's go bowling"

Degrees Of (Lock/Wait) Freedom

2015-05-19T20:33:00.000+01:00

Yo, Check the diagonal, three brothers gone...

I've been throwing around the terms lock-free and wait-free in the context of the queues I've been writing, perhaps too casually. The definition I was using was the one from D. Vyukov's website (direct quote below):

Wait-freedom: Wait-freedom means that each thread moves forward regardless of external factors like contention from other threads, other thread blocking. Each operations is executed in a bounded number of steps.
Lock-freedom: Lock-freedom means that a system as a whole moves forward regardless of anything. Forward progress for each individual thread is not guaranteed (that is, individual threads can starve). It's a weaker guarantee than wait-freedom. [...] A blocked/interrupted/terminated thread can not prevent forward progress of other threads. Consequently, the system as a whole undoubtedly makes forward progress.
Obstruction-freedom: Obstruction-freedom guarantee means that a thread makes forward progress only if it does not encounter contention from other threads. That is, two threads can prevent each other's progress and lead to a livelock.
Blocking Algorithms: It's the weakest guarantee - basically all bets are off, the system as a whole may not make any forward progress. A blocked/interrupted/terminated thread may prevent system-wide forward progress infinitely.

The above definitions refer to "forward progress/moving forward" in the context of "system" and "thread". In particular they translate "X-freedom" to "a guarantee that T/S makes forward progress".
Now "thread" is a well defined term, but what counts as forward progress of a given thread?
To my mind a thread which is free to return from a method is free to 'make progress'. For example: if the next element in a queue is not visible to a consumer thread I can return control to the consumer thread which is then free to make progress on anything it feels like and try getting an element out of the queue later. Similarly a producer thread which is unable to add elements to a queue is 'free' to 'make progress' even if the queue is full. In short, I interpreted 'freedom' as regaining control of execution.

Freedom? yeah, right...

My interpretation however is limited to the scope of the thread and assumes it has other things to do (so placing the thread within the context of a given 'system'). So the freedom to make progress is really a freedom to make progress on 'other things'. The definitions above when applied to a given data structure have no knowledge of the 'system' and it is therefore fair to assume nothing. And so if the 'system' is viewed as being concerned only with using the data structure, it seems my view of 'regained control freedom' is not inline with the 'progress making freedom'.
Let's consider for example the linked Multi-Producer-Single-Consumer queue discussed in the previous post (this is the not the j.u.Queue compliant version):

Now let us assume a producer has stalled in that unpleasant gap between setting the head and pointing prev to the new head(at line 34), this is what I've come to think of as a 'bubble' in the queue. The next producer will see the new head, but the consumer will not see it (or any nodes after it) until prev.next is set.
Others producers can 'make progress' in the sense that elements will get added to the queue. Those elements visibility to the consumer however is blocked by the 'bubble'. What is the correct degree of freedom for these producers? what about the consumer?
When describing the queue Vyukov makes the following statements:

Wait-free and fast producers. One XCHG is maximum what one can get with multi-producer non-distributed queue.
Push [offer in the Java version] function is blocking wrt consumer. I.e. if producer blocked in (* [line 34]), then consumer is blocked too. Fortunately 'window of inconsistency' is extremely small - producer must be blocked exactly in (* [line 34]).

So producers are wait-free, but consumer is blocking. The consumer is truly blocked by the 'bubble' and can make no progress in the terms of the data structure. Despite other producers adding nodes the consumer cannot see those nodes and is prevented from consuming the data. The fact that control is returned to the caller is considered irrelevant. And if we are being precise we should call this a blocking queue.

Life imitates Art

The "Art of Multiprocessor Programming" offers a different definition:

"A concurrent object implementation is wait free if each method call completes in a finite number of steps. A method is lock-free if it guarantees that infinitely often some method call finishes in a finite number of steps" - page 99

This sentiment is similarly echoed in "Wait-Free Queues With Multiple Enqueuers and Dequeuers" a paper on lock free queues:

[...] to ensure that a process (or a thread) completes its operations in a bounded number of steps, regardless of what other processes (or threads) are doing. This property is known in the literature as (bounded) wait-freedom.
lock-freedom ensures that among all processes accessing a queue, at least one will succeed to finish its operation.

Hmmm... no system, no progress. The boundaries are now an object and a method, which are well defined. I think this definition matches my understanding of 'regained control freedom' (open to feedback). If a method returns, progress or no progress, the condition is fulfilled. Under this definition the queue above is wait-free.
The wiki definition is again closer to the one outlined by Vyukov:

An algorithm is called non-blocking if failure or suspension of any thread cannot cause failure or suspension of another thread for some operations. A non-blocking algorithm is lock-free if there is guaranteed system-wide progress, and wait-free if there is also guaranteed per-thread progress.
Wait-freedom is the strongest non-blocking guarantee of progress, combining guaranteed system-wide throughput with starvation-freedom. An algorithm is wait-free if every operation has a bound on the number of steps the algorithm will take before the operation completes.
Lock-freedom allows individual threads to starve but guarantees system-wide throughput. An algorithm is lock-free if it satisfies that when the program threads are run sufficiently long at least one of the threads makes progress (for some sensible definition of progress). All wait-free algorithms are lock-free.

We are back to a fuzzy definition of system, progress, but the underlined sentence above is interesting in highlighting suspension. In particular I'm not sure how this is to be interpreted in the the context of the lock-free definition where suspension is not an issue if some threads can keep going.
Note that once we have fixed poll to behave in a j.u.Queue compliant way: The poll method is no longer wait-free by any definition.

What about the JCTools queues?

JCTools covers the full range of MPMC/MPSC/SPMC/SPSC range of queues, and aims for high performance rather than strict adherence to any of the definitions above. To be more precise in the definitions I would say:

SpscArrayQueue/SpscLinkedQueue are Wait Free (on both control and progress senses)
MpscArrayQueue is lock free on the producer side and blocking on the consumer side. I'm planning to add a weakPoll method which will be wait free in the control sense.
MpscLinkedQueue is wait free on the producer side and blocking on the consumer side. I'm planning to add a weakPoll method which will be wait free in the control sense.
SpmcArrayQueue is lock free on the consumer side and blocking on the producer side. I'm planning to add a weakOffer method which will be wait free in the control sense.
MpmcArrayQueue is blocking on the both producer and consumer side. I'm planning to add a weakOffer/Poll method which will be lock free in the control sense.

Does it matter that the queues do not meet the exact definitions set out below? As always it depends on your needs...

Non-Blocking Algorithms On The JVM?

{Full disclosure: please note when reading the next section that I work with Azul Systems on the Zing JVM, I'm not trying to sell anything, but I am naturally more familiar with it and consider it awesome :-)}

What happens when we include the JVM in our definition of a system? Can you build a non-blocking algorithm on the JVM? All the JVMs I know of are blocking in some particular cases, important to the discussion above are:

Allocation: The common case for allocation is fast and involves no locking, but once the young generation is exhausted a collection is required. Young generation collections are stop the world events for Oracle/OpenJDK. Zing has a concurrent young generation collector, but under extreme conditions or bad configuration it may degrade to blocking the allocator. The bottom line is that allocation on the JVM is blocking and that means you cannot consider an algorithm which allocates non-blocking. To be non-blocking a system would have to provably remove the risk of a blocking collection event on allocation.
Deoptimization: Imagine the JIT compiler in it's eager and aggressive compilation strategy has decided your offer method ends up getting compiled a particular way which ends up not working out (assumptions on inheritance, class loading, passed in values play a part). When the assumption breaks a deoptimization may take place, and that is a blocking event. It is hard to prove any piece of code is deoptimization risk free, and therefore it is hard to prove any Java code is non-blocking. Deoptimization is in many cases a warmup issue. Zing is actively battling this issue with ReadyNow which reloads previously recorded compilation profiles for the JVM, greatly reducing the risk of deoptimization. Oracle/OpenJDK users can do a warmup run of their application before actually using it to reduce the risk of deoptimization. My colleague Doug Hawkins gave a great talk on this topic which will give you far more detail then I want to go into here :-).
Safepoints: If your algorithm includes a safepoint poll it can be brought to a halt by the JVM. The JVM may bring all threads to a safepoint for any number of reasons. Your method, unless inlined, already has a safepoint on method exit(OpenJDK) or entry(Zing). Any non-counted loop will have a safepoint-poll, and this includes your typical CAS loop. Can you prevent safepoints from ever happening in your system? It might be possible for a JVM to provide users with compiler directives which will prevent safepoint polls from being inserted in certain code paths, but I don't know of any JVM offering this feature at the moment.

In short, if you are on the JVM you are probably already blocking. Any discussion of non-blocking algorithms on the JVM is probably ignoring the JVM as part of the 'system'. This needs to be considered if you are going to go all religious about wait-free vs. lock-free vs. blocking. If a system MUST be non-blocking, it should probably not be on the JVM.

Summary

Any term is only as good as our ability to communicate with others using it. In that sense, splitting hairs about semantics is not very useful in my opinion. It is important to realize people may mean any number of things when talking about lock-free/wait-free algorithms and it is probably a good idea to check if you are all on the same page.

I personally find the distinction between control/progress freedom useful in thinking about algorithms, and I find most people mean lock/wait-free excluding the effects of the JVM...

Thanks Martin & Doug for the review, feedback and discussion, any remaining errors are my own (but their fault ;-) )

Porting Pitfalls: Turning D.Vyukov MPSC Wait-free queue into a j.u.Queue

2015-04-22T10:51:00.002+01:00

{This post is part of a long running series on lock free queues, checkout the full index to get more context here}
D. Vyukov is an awesome lock-free dude, and I often refer to his instructive and invaluable website 1024cores.net in my posts. On his site he covers lock free queue implementations and in particular a wait-free MPSC linked node queue. This is really rather special when you consider that normally MP would imply lock-free rather than wait-free guarantees. I've ported his algorithm to Java (and so have many others: Netty/Akka/RxJava etc.), and had to tweak it to match the Queue interface. In this post I'd like to explain the algorithm, it's translation to Java, and the implications of making it a j.u.Queue.

Lock free vs. Wait free

Let's review the definitions:

Wait-free: thread progress is guaranteed, all operations finish in a determined number of steps.
Lock-free: global progress is guaranteed, though a particular thread may be blocked.

An example of a transition from lock free to wait free is available with JDK8 changes to AtomicReference::getAndSet(). The change was made by utilizing the newly available Unsafe::getAndSetObject intrinsic which translates directly to XCHG (on x86). So where we used to have for AtomicReference:
T getAndSet(T newVal) {
T currentValue;
do {
currentValue = val; // val is a volatile field
} while (!Unsafe.compareAndSwapObject(this, VAL_FIELD_OFFSET, currentValue, newValue));
return currentValue;
}
Now we have:
T getAndSet(T newVal) {
return Unsafe.getAndSetObject(this, VAL_FIELD_OFFSET, newValue);
}
I discussed a similar change to AtomicLong.getAndAdd in a previous post, replacing the CAS loop with LOCK XADD.

The Vyukov Wait Free MPSC queue

This is a LinkedList type structure and the interesting methods are offer and poll, here's the original (I did some formatting):
Awesome in it's simplicity, deceptive in it's genius. Be aware that head/tail meaning is the other way around than what most people are used to. I personally go for producer/consumerNode (head = producer side, tail = consumer side in the above snippet) in my code, but for consistency I'll stick with Mr. Vs notation for the porting exercise.
But how do we manage the same memory barriers in Java? We can be nice about it and use the AtomicFieldUpdater or more brutal and use Unsafe. I find you get better performance with Unsafe, but you should consider how appropriate this is for you. In any case, here's what we end up with:
The code is pretty similar. Now if we wanted to complete the j.u.Queue we could extend AbstractQueue and implement only size()/peek() and we're done (I'm not going to bother with the iterator()):
There we go, seems reasonable don't it? off to the pub to celebrate some kick ass lock free programming!

Vyukov::poll() ain't Queue::poll!

I've discussed this particular annoyance in a previous post which has some overlap with this one. The names are the same, but as it turns out the guarantees are quite different. While Vyukov is doing a fine job implementing a queue, not any queue is a j.u.Queue. In particular for this case, the poll() method has a different contract:

j.u.Queue: Retrieves and removes the head of this queue, or returns null if this queue is empty.
Vyukov: Retrieves and removes the head of this queue, or returns null if next element is not available.

Why wouldn't the next element be available? Doesn't that mean the queue is empty? Sadly this ain't the case. Imagine for instance there are 2 producers and we run both threads step by step in a debugger. We break at line 33 (4th line of offer, the Java version):

Producer 1: we step 1 line, we just replaced head with node n. We are suspended before executing line 35.
Producer 2: we let the program continue. We've replaced head and linked it to the previous head.

What state are we in? Let's look at head/tail nodes and where they lead (I number the nodes in order of assignment to head):

head = Node[2], this is the node created by Producer 2. We also know that Node[1].next = Node[2], because we let offer run it's course on Producer 2.
tail = Node[0] the node we allocated in the constructor. This is the node head was before the first producer came along. This is what prev is equal to for Producer 1, but because we suspended that thread it never set it's next value. Node[0].next is still null!

If a consumer came along now they would get a null from poll(), indicating the queue is empty. But the queue is obviously not empty!
So it seems we cannot deduce the queue is empty from looking at tail.next here's 2 valid indicators that the queue is empty:

head == tail : this is the starting point set in the constructor and where the consumer ends up after consuming the last element
head.val == null : head can only have a value of null if it is tail

Here's a solution to a correct poll() and the knock on effect on peek():
This is a bit annoying because now we have a wait-free offer(), but poll() and peek() are lock-free(only block one thread, producers can make progress).
This pitfall is tricky enough that not only did I fall for it (on another queue algorithm, but same mistake), it also took by surprise Mr. Manes writing an interesting variant on this queue for his high performance cache implementation (I filed an issue which he promptly fixed), and struck the notorious Dave Dice when considering Vyukov's MPMC queue (see comments for discussion).
So is this it? Are we done?
Almost... size() is still broken.

A Good Size

It's not really surprising that size is broken given the terminating condition for the size loop was relying on next == null to terminate the count. Size was also broken in 2 other subtle ways:

The interface for size dictates that it returns a positive int. But given that the queue is unbounded it is possible (though very unlikely) for it to have more than 2^31 elements. This would require that the linked list consume over 64GB (16b + 8b + 8b=32b per element, refs are 8b since this requires more than 32GB heap so no compressed oops). Unlikely, but not impossible. This edge condition is handled for ConcurrentLinkedQueue (same as here), and for LinkedBlockingQueue (by bounding it's size to MAX_INT, so it's not really unbounded after all) but not for LinkedList (size is maintained in an int and is blindly incremented).
Size is chasing a moving target and as such can in theory never terminate as producers keep adding nodes. We should limit the scope of the measurement to the size of the queue at the time of the call. This is not done for CLQ and should perhaps be considered.

Here's the end result:

If you need a full SPSC/MPSC linked queue implementation they are available in JCTools for your pleasure, enjoy!

Special thanks to Doug and Ben for reviewing!

On Arrays.fill, Intrinsics, SuperWord and SIMD instructions

2015-04-13T10:45:00.002+01:00

{This post turned rather long, if you get lazy feel free to skip to the summary}
Let's start at the very beginning, a very good place to start... My very first post on this blog was a short rant on intrinsics, and how they ain't what they seem. In that post I made the following statement:

"intrinsic functions show up as normal methods or native methods"

Which is correct. An intrinsic function is applied as a method substitution. A method call will appear in the code and the compiler will replace it's apparent source-level implementation with a pre-cooked implementation. In some cases intrinsics are sort of compilation cheats, the idea being that some bits of functionality are both very important (i.e. worth while optimizing) and can benefit from a hand crafted solution that will be better than what the compiler can achieve. The end result can be in one of a few flavours:

Method call replaced with a call to a JVM runtime method: E.g. System.arrayCopy is replaced with a call to a method stub generated by the runtime for all array types. This method call is not a JNI call, but it is a static method call that is not inlined.
Method call replaced with one or more instructions inlined: E.g. Unsafe.getByte/compareAndSet/Math.max
Method call replaced with compiler IR implementation: E.g. java.lang.reflect.Array.getLength
A mix of the above: E.g. String.equals is partially implemented in IR, but the array comparison is a call to a method stub.

The intrinsics are all set up in vmSymbols.hpp and if you look, you'll see Arrays.fill is NOT on the list. So why am I talking about Chewbacca? Because it is something like an intrinsic...

The Arrays.fill SIMD Opportunity

Arrays.fill is the Java memset (fills an array with a given value), and just like System.arrayCopy (memcpy in C lingo) is worth the effort to optimize and offers the same kind of opportunity. What opportunity might that be, you ask? the opportunity to use SIMD (Single Instruction Multiple Data) instructions when the underlying CPU offers them (I assume for the sake of discussion AVX enabled CPUs i.e. since Sandy Bridge, I find this listing of intel intrinsics useful to explain and sort through the available instructions). These instructions allow the CPU to operate on up to 256 bit (512 bit soon) chunks of data, thus transforming 32 byte sized MOV instructions into a single wide MOV instruction (E.g. the intel C instrinsic _mm256_storeu_si256 or the corresponding instruction vmovdqu). SIMD instructions are good for all sorts of operations on vectors of data, or arrays, which is why the process of transforming element by element operations into SIMD instructions is also referred to as vectorization.
The actual assembly stub is generated dependent on CPU and available instruction set. For x86 the code is generated by the macroAssembler_x86.cpp, and the observant digger into the code will find it makes use of the widest memory instructions it can identify the processor is capable of. Wider is better baby! If you are not morbidly curious about what the implementation looks like, skip the next wall of assembly and you'll be back in Java land shortly.
Here's what the assembly boils down to when UseAVX>=2/UseSSE>=2/UseUnalignedLoadStores=true:

Roughly speaking the algorithm above is:

Fill up an XMM register with the intended value
Use the XMM register to write 64 byte chunks (2 vmovdqu) until no more are available
Write leftover 32 byte chunk (skipped if no matching leftovers)
Write leftover 8 byte chunks (skipped if no matching leftovers)
Write leftover 4 bytes (skipped if no matching leftovers)
Write leftover 2 bytes (skipped if no matching leftovers)
Write leftover 1 bytes (skipped if no matching leftovers)

It ain't nice, but we do what we gotta for performance! There are variations of the above described across the internets as the done thing for a memset implementation, this might seem complex but is pretty standard... anyway, moving right along.

The Arrays.fill 'intrinsic'

Arrays.fill is different from System.arrayCopy because, as it's absence from vmSymbols suggests, it's not a method substitution kind of intrinsic (so technically not an intrinsic). What is it then? Arrays.fill is a code pattern substitution kind of compiler shortcut, basically looking for this kind of loop:

And replacing it with a call into the JVM memset implementation (I recently learnt the same thing is done by GCC as well, see code to assembly here). The pattern matching bit is done in loopTransform.cpp. This feels enough like an intrinsic grey area that the method doing the pattern match and replace is called intrinsify_fill.
Pattern matching makes this optimization potentially far more powerful than method substitution as the programmer doesn't have to use a special JDK method to convey meaning, they can just express their meaning in code and the compiler 'knows' that this simple loop means 'fill'. Compare that with System.arrayCopy where rolling your own leads to performance that is much worse than that offered by the intrinsic.
Let's prove me right (my favourite thing, beats kittens and all that crap), here's a JMH (see the JMH reference page for more JMH info/examples) benchmark comparing Arrays.fill to a hand rolled fill, and System.arrayCopy to handrolled array copy:
And the results are (Oracle JDK8u40/i7-4770@3.40GHz/Ubuntu, array is 32K in size)?
ArrayFill.fillBytes 561.540 ± 10.814 ns/op
ArrayFill.manualFillBytes 557.901 ± 5.255 ns/op
ArrayFill.manualReversedFillBytes 1017.856 ± 0.425 ns/op
ArrayFill.copyBytes 1300.313 ± 13.482 ns/op
ArrayFill.manualCopyBytes 1477.442 ± 13.030 ns/op

We can verify that the call out to the JVM fill method happens for fillBytes/manualFillBytes by printing out the assembly:

So what have we learnt so far:

Use System.arrayCopy, it is better than your handrolled loop. But surprisingly not hugely better, hmmm.
You don't have to use Arrays.fill, you can roll your own and it works the same. Notice the call out to the fill method. But...
Don't get too creative rolling your own. If you get too funky (like filling the array backwards) it'll fall apart and the 'intrinsic' won't happen. But do note that the reverse fill still has some of that good SIMD stuff going, we'll get to that in a sec.

Are The Other Types Filling The Love?

It all sounds great don't it? Let's see how this pans out for other types. We'll be filling an array of 32KB. To be uniform across data types that means a 16K chars/shorts array, an 8K ints/floats array and a 4K array of longs. I added an 8K array of objects, which is the same size for compressed oops on the Oracle JVM (reference size is 4 bytes, same as an int).
The JMH benchmark code is as you'd expect:
Here's some reasonable expectations:

If no optimizations are present, wider writes are more efficient. It follows that the longFill would be the fastest. But...
Given a clever compiler the fill loop is replaced with the widest writes possible, so there should be no significant difference. But the fill optimization does not cover double/long/object arrays, so we might expect longFill to be the worst performer.
An objects array is not that different from an int array, so performance should be similar. Sure there's a write barrier, but it need only be done once per card (not once for the whole array as I thought initially, god bless Shipilev and PrintAssembly), so that's an extra byte write per card of elements filled. A card is per 512 bytes, each element is 4 bytes, so that's one card per 128 elements. Given there is no fill method implemented for it we may expect it to be slightly worse than the longFill.
We should not rely on expectations, because performance is best measured.

As you'd expect the results are somewhat different than the expectations (Oracle JDK8u40/i7-4770@3.40GHz/Ubuntu):

ArrayFill.fillBytes 561.540 ± 10.814 ns/op

ArrayFill.fillChars 541.901 ± 4.833 ns/op

ArrayFill.fillShorts 532.936 ± 4.508 ns/op

ArrayFill.fillInts 543.165 ± 3.788 ns/op

ArrayFill.fillFloats 537.599 ± 2.323 ns/op

ArrayFill.fillLongs 326.770 ± 3.267 ns/op

ArrayFill.fillDoubles 346.840 ± 5.786 ns/op

ArrayFill.fillObjects 4388.242 ± 11.945 ns/op

Say WOT?
For bytes/chars/shorts/ints/floats Arrays.fill performs very similarly. This much is as expected from the second point above. But filling an array of longs/doubles is better than the others. The funny thing is, there's no fill function implemented for the long array, how come it is so darn quick? Also, why does the objects fill suck quite so badly when compared with the rest (I will not be addressing this last question! I refuse! this post is too fucking long as it is!)?
This is what happens when we turn off the OptimizeFill flag:

ArrayFill.fillBytes 1013.803 ± 0.227 ns/op
ArrayFill.fillChars 323.806 ± 3.879 ns/op
ArrayFill.fillShorts 323.689 ± 4.499 ns/op
ArrayFill.fillInts 326.336 ± 1.559 ns/op
ArrayFill.fillFloats 319.449 ± 2.048 ns/op
ArrayFill.fillLongs 328.692 ± 3.282 ns/op
ArrayFill.fillDoubles 345.035 ± 6.362 ns/op
ArrayFill.fillObjects 4397.130 ± 7.161 ns/op

Strange innit? now we got char/int/long arrays all performing similarly. In fact, with the exception of the byte array, everything is better than it was with the optimization.

Superword to the rescue!

Turns out the JIT compiler is clued up on the topic of SIMD parallelisation by way of Superword Level Parallelism (see the original paper here):

In some respects, superword level parallelism is a restricted form of ILP (Instruction Level Parallelism). ILP techniques have been very successful in the general purpose computing arena, partly because of their ability to find parallelism within basic blocks. In the same way that loop unrolling translates loop level parallelism into ILP, vector parallelism can be transformed into SLP. This realization allows for the parallelization of vectorizable loops using the same basic block analysis. As a result, our algorithm does not require any of the complicated loop transformations typically associated with vectorization. In fact, vector parallelism alone can be uncovered using a simplified version of the SLP compiler algorithm.
...
Superword level parallelism is defined as short SIMD parallelism in which the source and result operands of a SIMD operation are packed in a storage location.
...
Vector parallelism is a subset of superword level parallelism.

The Hotspot compiler implements SLP optimizations in superword.cpp and you are invited to dive into the implementation if you like. I'm going to focus on it's impact here, and to do that I only need to know how to turn it on and off (core competency for any software person). It's on by default, so above results are what happens when it is on, here's what life looks like when it is off too (so -XX:-OptimizeFill -XX:-UseSuperWord):
ArrayFill.fillBytes 8501.270 ± 2.896 ns/op
ArrayFill.fillChars 4286.161 ± 4.935 ns/op
ArrayFill.fillShorts 4286.168 ± 3.146 ns/op
ArrayFill.fillInts 2152.489 ± 2.653 ns/op
ArrayFill.fillFloats 2140.649 ± 2.587 ns/op
ArrayFill.fillLongs 1105.926 ± 2.228 ns/op
ArrayFill.fillDoubles 1105.820 ± 2.393 ns/op
ArrayFill.fillObjects 4392.506 ± 11.678 ns/op

Life is revealed in all it's sucky splendour! This is what happens when the compiler shows you no love... did I say no love? hang on, things can get a bit worse.

Detour: Unsafe? We don't serve you kind here

To all the Unsafe fans, I got some sad news for y'all. Unsafe 'fill' loops are not well loved by the compiler. This is the price of stepping off the beaten path I guess. Consider the following benchmark:
The results are:
ArrayFill.unsafeFillOffheapBytes 9742.621 ± 2.270 ns/op
ArrayFill.unsafeFillOnHeapBytes 12640.019 ± 1.977 ns/op
ArrayFill.fillBytes(for reference) 561.540 ± 10.814 ns/op
The Unsafe variant do not enjoy the 'fill' pattern matching magic, nor do they get the SuperWord optimizations. What can you do? For this kind of thing you should use the Unsafe.setMemory method instead:
With the result:
ArrayFill.unsafeSetOffheapBytes 1259.281 ± 21.294 ns/op
ArrayFill.unsafeSetOnHeapBytes 1275.158 ± 27.950 ns/op
Not quite there, still ~2x worse (why? how come it doesn't just call the bytes fill method? a bit of digging shows it ends up calling the underlying platform's memset...) but beats being 20-25x worse like the handrolled method is.

Summary and Musings

It's the circle of life!

So what did we learn:

There's another kind of 'intrinsic' like optimization, which uses pattern matching to swap a block of code rather than a method. This is employed for memset like memory fill loops (in particular Arrays.fill) intrinsicfication. It's not an intrinsic technically, but you know what I fucking mean.
System.arrayCopy/Arrays.fill implementations utilize SIMD instructions to improve their efficiency. These instructions are not available in plain Java, so some compiler intervention is required.
The JIT compiler is also able to use SuperWord Level Parallelism to derive SIMD code from 'normal' sequential code.
In the case of Arrays.fill, it looks like the SuperWord optimized code is faster than the fill specialized implementation for all types except bytes (on the system under test)
If you use Unsafe you will be excluded from these optimizations.

So I look at this process and I imagine history went something like this:

We want to use SIMD instructions, but the JIT compiler isn't really clever enough to generate them by itself. Memset implementations are rather specialized after all. Let's make life a bit easier for the compiler by creating an intrinsic. We'll even go the extra mile and make an effort to automatically identify opportunities to use this intrinsic, so now it's not really an intrinsic any more. The Arrays.fill optimization is available on Oracle JDK6u45 (the oldest I keep around, maybe it was there a while before that) and on that JVM it is twice as fast as the SLP generated code.

Over time, SLP gets better and eventually the compiler is now good enough to optimize the fill loop by itself and beat the specialized method. That is an awesome thing. We just need to remove the training wheels now.

And there's a final punch line to this story. Memset/Memcpy are such common and important opportunities for optimization, so Intel has decided to offer an assembly 'recipe' for them and save everyone the effort in writing them:

3.7.6 Enhanced REP MOVSB and STOSB operation (ERMSB)
Beginning with processors based on Intel microarchitecture code name Ivy Bridge, REP string operation using MOVSB and STOSB can provide both flexible and high-performance REP string operations for soft- ware in common situations like memory copy and set operations. Processors that provide enhanced MOVSB/STOSB operations are enumerated by the CPUID feature flag: CPUID:(EAX=7H, ECX=0H):EBX.ERMSB[bit 9] = 1. - [From the Intel Optimization Manual(September 2014)]

From the manual it seems that this method of implementing memcpy/memset can perform well, but like anything else, YMMV (the intel manual discussion of the performance differences is in itself interesting both on the results and the methodology level). One obvious advantage of this method is that it results in much much smaller code that should be trivial to inline into callers. This will however put the SuperWord method at a slight disadvantage, and the tide will change again.
[UPDATE 14/03/2015: It seems the good folks of Oracle have considered and rejected the use of REP MOVSB for array copy.]
Thanks go to the kind reviewers Peter 'Massive' Hughes, Darrach and the Shipster

Correcting YCSB's Coordinated Omission problem

2015-03-11T10:45:00.001+00:00

YCSB is the Yahoo Cloud Serving Benchmark(also on wiki): a generic set of benchmarks setting out

The Nimbus Cloud Serving Board

to compare different key-value store providers under a set of loads:

The goal of the Yahoo Cloud Serving Benchmark (YCSB) project is to develop a framework and common set of workloads for evaluating the performance of different "key-value" and "cloud" serving stores.

The code is open for extension and contribution and all that good stuff, you can get it here. And it has become tool for comparing vendors in the NoSQL space. The benchmarks set out to measure latency and throughput. The terms are not directly defined in the paper, but the following statement is made:

The Performance tier of the benchmark focuses on the latency of requests when the database is under load. Latency is very important in serving systems, since there is usually an impatient human waiting for a web page to load. [...] Typically application designers must decide on an acceptable latency, and provision enough servers to achieve the desired throughput while preserving acceptable latency. [...] The YCSB Client allows the user to define the offered throughput as a command line parameter, and reports the resulting latency, making it straightforward to produce latency versus throughput curves.

What could possibly go wrong?™

It can go this wrong for instance, order of magnitude difference in results for different percentiles, leading to some poor decision making on how much hardware you'll need, leading to getting fired from your job and growing old bitter and twisted mumbling to yourself as you get drunk on the street corner until you freeze to death on a winter night. So potentially this is a risk to your future well being, listen up!

It's broken? Coordinated WOT?

When you measure latency bad,
Mr. Tene is sad
:(

My colleague at Azul, Gil Tene, the magnificent, glorious, multi-dimensional, coding CTO, officer and gentleman (that's my bonus sorted) has been doing a lot of latency related preaching and teaching in the last few years. He has given the following talks at any number of conferences, but if you happened to have missed them, watch them NOW:

"How NOT to Measure Latency" - At QCon London, 2013
"Understanding Latency" - At React Conf SF, 2014

In particular he has coined the term "Coordinated Omission" (see raging discussion on Mechanical Sympathy) to describe instances in which the measuring/monitoring system coordinates measurement with the system under test/measurement such that samples are biased. This issue manifests in many load generating frameworks where the call into the system under test is done synchronously and the measurement thread holds up the next sample while the call is ongoing. This enables the system under test to delay requests that would have been made during the synchronous call thus skewing the sample set. Consider for example a system where:

We set out to measure a request every 1ms (from a single thread, synchronously)
The first 500 calls come back in 100µs each (so call K starts at Kms and returns at Kms + 100µs )
Call 501 takes 500 milliseconds (starts at 500ms, returns at 1 second)
Call 502 takes 100µs

See the problem?

The problem is that call 502 did NOT happen at it's designated time, and saying it took 100µs fails to capture this. It failed the assumptions laid out in the first sentence because we were blocked for 500ms. If we were to stick to our original schedule we would be making calls 502 to 1000 in the time it took for call 501 to execute. How should we treat this departure from plan?

Is it safe to come out yet?

Ignore it! it will go away by itself! - This is the coordinated omission way. We are now reporting numbers that are no longer according to the test plan, which means that our "latency versus throughput curves" are off the mark. This is a very common solution to this issue.
Fail the test, we wanted a call every ms and we didn't get that - This is an honest hardline answer, but it potentially throws the baby with the bath water. I think that if you set out to schedule 1000 calls per second you might want to see how often this falls apart and how. But this answer is The Truth™, can you handle it? If one is to start from scratch and write their own load generator I propose a read of the Iago load test framework philosophy page: "Iago accurately replicates production traffic. It models open systems, systems which receive requests independent of their ability to service them. Typical load generators measure the time it takes for M threads to make N requests, waiting for a response to each request before sending the next; if your system slows down under load, these load testers thus mercifully slow down their pace to match. That's a fine thing to measure; many systems behave this way. But maybe your service isn't such a system; maybe it's exposed on the internet. Maybe you want to know how your system behaves when N requests per second come in with no "mercy" if it slows down.". This is a fine sentiment.
Coordinated Omission correction: Adjust the results to reflect the expected call rate. This can be done in a straight forward manner if the 'missing' calls are added back with a latency which reflects the period for which they were delayed. This correction method is supported out of the box by HdrHistogram but the discussion regarding it's over or under estimation of the impact of the delay is outside the scope of this post.
Coordinated Omission avoidance: Measure all calls according to original schedule. We are now saying: "If I can't make the call, the meter is still running!". This is particularly relevant for systems where you would typically be making the requests to the system under test from a thread pool. That thread pool would be there to help you support asynchronous interaction where the API failed in giving you that option. Like JDBC... Like many key-value pair provider APIs.

This last solution is the one we'll go for in this post, but I would urge you to consider the results critically. In particular if you are trying to simulate independent access to a web server (as opposed to a DB via a thread pool) then the adherence to schedule might be hugely optimistic of the results in your case. This is because failing to generate independent load may have all sorts of beneficial effects on the system under test.
For the YCSB benchmark I'm assuming the harness/load generator is simulating a web serving layer accessing the key-value store in an effort to serve an unbounded, uncoordinated user request load via a predefined thread pool. So it's door number 3 for me. The corrected load generator is here.

Step 0: Some preliminary work (not strictly required)

HdrHistogram, as approved by Freddie!

As described previously here, we should all just get on with capturing latency using HdrHistograms. So as a first step toward correcting YCSB I have gone in and added an HdrHistogram measurement container. This is pretty straight forward as all I needed to modify was the Measurements class to allow a new measurement type. While I was there I tweaked this and that and the following list of changes to that class emerged:

Add new measurement type and corresponding command-line option("-p measurementtype=hdrhistogram")
Add combined measurement option allowing old/new measurement side by side: "hdrhistogram+histogram"
Add support for capturing both corrected and uncorrected measurements for the same run.
Use CHM instead of synchronizing around a HashMap.

The new measurement type supports logging loss less HdrHistogram data to a file (controlled by the hdrhistogram.fileoutput=<true|false> option and the hdrhistogram.output.path=<path> option) as well as better precision percentile data and lock free logging of latencies. This is not very interesting work but if you are interested in the "How would I plug in my own data structure to capture latency into YCSB?" topic have fun. It was not necessary for correction but it was good to do so that better quality results can be observed. You're welcome.

Step 1: Demonstrate the issue

YCSB includes a very useful means of verifying the measurements in the form of a mock DB driver. This means we can test our assertions regarding coordinated omission without setting up a key value store of any kind. The mock DB is called BasicDB and is the default DB used. We can configure it to simulate a pause and see what happens (-p basicdb.verbose=false -p basicdb.simulatedelay=4 will make the mock DB stop logging every action and simulate a latency of 0-4ms for each action). I added a further option to the BasicDB which allows us to turn off the randomization of the delay (-p basicdb.randomizedelay=false).

Let's consider our expectations in the case where a DB simply cannot handle request quickly enough. We can setup an experiment with the following settings: -target 1000 -threads 1 -s -p status.interval=1 -p workload=com.yahoo.ycsb.workloads.CoreWorkload -p basicdb.verbose=false -p basicdb.simulatedelay=4 -p basicdb.randomizedelay=false -p measurementtype=hdrhistogram -p maxexecutiontime=60

Here's what they all mean:

-target 1000 -> We aim to test 1000 requests per second
-threads 1 -> We have a single client thread
-s -p status.interval=1 -> We will be printing out status every second (I made the status interval configurable)
-p basicdb.verbose=false -p basicdb.simulatedelay=4 -p basicdb.randomizedelay=false -> The DB will sleep 4ms on each request, so the maximum we can hope for is 250, no noisy printing per operation please
-p measurementtype=hdrhistogram -> Use HdrHistogram to capture the latencies
-p maxexecutiontime=60 -> Run for one minute, then exit and print summary

This DB is obviously failing, it can't keep up with the rate of incoming requests and according to our model they queue up. The time measured per call is reflected in the summary for the READ operations:

[READ], Operations, 12528.0

[READ], AverageLatency(us), 4477.102809706258

[READ], MinLatency(us), 4018.0

[READ], MaxLatency(us), 44703.0

[READ], 95thPercentileLatency(ms), 4.0

[READ], 99thPercentileLatency(ms), 4.0

[READ], Return=0, 12528

But this completely ignores the time spent on the queue. If we were measuring according to schedule we'd get the following set of latencies:

Latency[k] = 4 + 3*(k-1) ms

The max latency would be for the last request to get in. We ran for 60 seconds, at 250 requests/sec which means our last request was (k=15000) and had a latency of 45 seconds when measured from the time we intended to make it. This number reflects the system's failure to handle load far more correctly than the numbers quoted above.

Step 2: Working to Schedule

The YCSB load generator has a weak notion of schedule, in the sense that it opts for option number 1 above and will just execute the operations when it can. When faced with the task of correcting this kind of issue (in a pile of foreign code) we need to look for 2 things in the load generator:

"Scheduling an action to run at time X" - This will involve some calls to one of the many scheduling facilities in the JDK:

Thread.sleep is an old favourite, but TimeUnit also supports a sleep method. A search for sleep in the code base will cover both. This is what YCSB was using to schedule next event to fire.
Code submitting tasks to java.util.Timer, or alternatively the ScheduledExecutorService
Code using LockSupport.parkNanos
Object.wait(...)
others?

"Measuring the length of an operation" - This will involve calls to System.nanoTime() or currentTimeMillis(). For YCSB this is found to happen for example here.

To correct this problem I had to introduce the concept of 'intended start time' to the operations measurement. Schedule for YCSB is specified by the -target command line option which sets the overall number of operations per second to be attempted by the load generator. This is optional, and the default is to go as fast as you can manage, i.e. with no schedule but the back pressure from the system under test to guide us. I'm not sure what a good assumed rate of requests is reasonable in this case, so I did not correct this case. NOTE: If you don't specify target no correction will take place.

The target parameter is translated to a per-thread operation rate (number of threads is set via the threads option, default is 1) so if we have 10 threads, and the target request rate is 1000 (-target 1000 -threads 10) we will have each thread hitting the store with 100 requests per second. The client threads randomize the first operation time to avoid all hitting the store on the same interval. I did some ground work here by setting the units across the board to nanoseconds and naming interval parameters appropriately, nothing too exciting.
The actual correction at it's core involves:

Record the operation's intended start time
Use the intended start time when computing latency

Sadly the way YCSB measures latency does not lend itself to a simple in place fix. The operations are scheduled by the ClientThread which calls into a workload, calling into a DB, which is actually the DBWrapper which measures the latency (for calling into an actual DB implementation) and reports it to the central Measurements singleton. This means that changing the Workload/DB API to include a startTime parameter to each call is quite a far reaching change which would require me to dig through all the DB drivers implementations and would result in a very unpleasant time for all.
I settled on using a thread local on the Measurements object to transfer the start time to the DBWrapper, it is not a nice way to do things (and I'm happy to hear better suggestions) but it does the job without modifying the API.
Once we have:

ClientThread setting up the start time for the operation via Measurements
DBWrapper using the start time from Measurements to compute the operation latency

That's pretty much it. For extra points I wanted to include some facilities to compare measurements before/after the change. These can be removed if we accept HdrHistogram as a replacement and if we accept we only want to measure the intended latency, which would result in a much smaller PR.

Step 3: is the issue solved?

Running the setup from step 1 such that it produces the intended latency as well as the original measurement side by side(-p measurement.interval=both) yields the following result for the READ operations:

[READ], Operations, 12414.0

[READ], AverageLatency(us), 4524.981069759949

[READ], MinLatency(us), 4018.0

[READ], MaxLatency(us), 24703.0

[READ], 95thPercentileLatency(ms), 4.0

[READ], 99thPercentileLatency(ms), 4.0

[READ], Return=0, 12414

[Intended-READ], Operations, 12414.0

[Intended-READ], AverageLatency(us), 2.359010991606251E7

[Intended-READ], MinLatency(us), 4256.0

[Intended-READ], MaxLatency(us), 4.6989311E7

[Intended-READ], 95thPercentileLatency(ms), 42369.0

[Intended-READ], 99thPercentileLatency(ms), 46530.0

This reflects the effect a backed up system would have on latency as we express in Step 1 above. It's actually a bit worse because the average cost of calling the mock DB with a sleep of 4ms is 4.5ms. As we can see the maximum latency is 46.9 seconds, reflecting the fact that the last read to execute was scheduled to hit the system 13.1 seconds into the run.

Step 4: The limitations of the harness

We can now also consider the perfect DB for the sake of observing the short comings of the test harness by setting the mock DB delay to 0(-p basicdb.simulatedelay=0):

[READ], Operations, 56935.0

[READ], AverageLatency(us), 0.01796785808377975

[READ], MinLatency(us), 0.0

[READ], MaxLatency(us), 49.0

[READ], 95thPercentileLatency(ms), 0.0

[READ], 99thPercentileLatency(ms), 0.0

[READ], Return=0, 56935

[Intended-READ], Operations, 56935.0

[Intended-READ], AverageLatency(us), 232.37026433652412

[Intended-READ], MinLatency(us), 0.0

[Intended-READ], MaxLatency(us), 39007.0

[Intended-READ], 95thPercentileLatency(ms), 0.0

[Intended-READ], 99thPercentileLatency(ms), 0.0

How come it take so long to measure a noop? why such large differences? Here's some generic theories and how they panned out:

The JVM running the load generator is running with suboptimal settings(-Xms64m -Xmx64m -XX:+PrintGCDetails -XX:+PrintGCApplicationStoppedTime, Oracle JDK8u31) on a busy Mac laptop running on battery

This is no way to benchmark anything, but the interesting thing is that if we have no schedule to stick to the test harness is willing to just ignore the issue. If we run on a decent machine (with a decent OS) we get nicer results. This is from a server class machine running CentOS6.3/OracleJDK8u25 with same settings:

[READ], Operations, 56930.0

[READ], AverageLatency(us), 0.44417705954681186

[READ], MinLatency(us), 0.0

[READ], MaxLatency(us), 20.0

[READ], 95thPercentileLatency(ms), 0.0

[READ], 99thPercentileLatency(ms), 0.0

[READ], Return=0, 56930

[Intended-READ], Operations, 56930.0

[Intended-READ], AverageLatency(us), 146.31262954505533

[Intended-READ], MinLatency(us), 15.0

[Intended-READ], MaxLatency(us), 14255.0

[Intended-READ], 95thPercentileLatency(ms), 0.0

[Intended-READ], 99thPercentileLatency(ms), 0.0

This is still significant.

The JVM suffers from warmup related artefacts

This certainly correlated to the max values I'm seeing here. When looking at the status line for the first second I see:

[READ: Count=22, Max=14, Min=0, Avg=0.64, 90=0, 99=14, 99.9=14, 99.99=14]

[Intended-READ: Count=23, Max=14255, Min=15, Avg=5563.39, 90=13287, 99=14255, 99.9=14255, 99.99=14255]
But after a few seconds the process settles and we see much better results, this is typical:
[READ: Count=947, Max=14, Min=0, Avg=0.02, 90=0, 99=0, 99.9=2, 99.99=14]
[Intended-READ: Count=946, Max=194, Min=61, Avg=151.29, 90=165, 99=175, 99.9=186, 99.99=194]
A good way to handle this issue is by relying on the HdrHistogram output files to grab relevant time periods for analysis. With the original YCSB output we have the percentile summary data, but this is not something we can combine for analysis. With the loss-less interval histogram logs we can look at any sub-period(which is longer than one interval, but shorter than the whole run) and get accurate full range histogram data. A common practice is to discard warmup period results, I'm no a fan of throwing away data, but since this is the load generator warmup I'd think it's quite legitimate. It's perhaps an interesting feature to add to such a framework that the framework can be warmed up separately from the system to examine cold system behaviour.

Thread.sleep/LockSupport.parkNanos are not super accurate and may wakeup after the intended operation start time

I've added an option for spinning instead of sleeping (so burn a CPU). This has improved the average value dramatically from ~146µs to ~3.1µs. A typical status line now looks like:

[READ: Count=947, Max=13, Min=0, Avg=0.02, 90=0, 99=0, 99.9=3, 99.99=13]

[Intended-READ: Count=948, Max=57, Min=0, Avg=0.47, 90=1, 99=12, 99.9=26, 99.99=57]

It is obviously not desirable for the load generator to burn a CPU instead of using sleep, but it also introduces scheduling inaccuracies. This is an accuracy issue we didn't have to deal with when not measuring from a schedule. This didn't impact the measured outliers, but has dramatically reduced their number. The take away here is just that there are accuracy limitations to the load generators ability to stick to schedule.

GC pauses that are large enough to derail the schedule on the load generator side are now captured. Unless the GC pauses happen inside the measurement gap we will have no idea we have gone off schedule if we don't track the intended start time.

We should capture GC logs on load generator side and make sure we correlate the GC events with recorded latencies. Here's a GC pause being captured by the corrected measurement:

[READ: Count=952, Max=0, Min=0, Avg=0, 90=0, 99=0, 99.9=0, 99.99=0]

[Intended-READ: Count=952, Max=14, Min=0, Avg=0.03, 90=0, 99=0, 99.9=3, 99.99=14]

[GC (Allocation Failure) [PSYoungGen: 17895K->1824K(18944K)] 17903K->1840K(62976K), 0.0024340 secs] [Times: user=0.01 sys=0.01, real=0.01 secs]

Total time for which application threads were stopped: 0.0026392 seconds

[READ: Count=957, Max=0, Min=0, Avg=0, 90=0, 99=0, 99.9=0, 99.99=0]

[Intended-READ: Count=957, Max=2719, Min=0, Avg=5.21, 90=0, 99=0, 99.9=2119, 99.99=2719]

This process is running with a 64M heap, you can expect longer pauses as the heap grows (in particular as the young generation grows).

The operation setup time is now being measured as well as the operation itself.

When running with the spin option we can see the average operation cost is ~3.1µs, this is all test harness overhead and is really quite negligible in the context of network hopping operations. In other words, nothing to worry about for this harness but could well prove an issue for others.

Step 5: The Good, The Bad And The STW pausing DB

Many software processes have a latency profile that is far from normally distributed. To see what YCSB makes of this kind of profile now that we can compare corrected vs. uncorrected measurement I have built a mock DB that has 4 mods of latency (p is uniform random number [0,1]):

Awesome (p < 0.9): we return in 200µs-1ms
Good (0.9 < p < 0.99): we return in 1-10ms
Minor Hiccup( 0.99 < p < 0.9999): we hit a bump, but only one thread is affected 10-50ms
Major Hiccup(0.9999 < p): we hit a STW pause(because GC/THP/LBJ/STD/others), all threads stop for 50-200ms

I implemented the above with a read write lock, where the STW pause grabs the write lock and all the others grab the read lock. This is far from a perfect representation of a system (everyone waits for STW as intended, but also STW waits for everyone to start), but it will do. If you feel strongly that a better simulation is in order, write one and let's discuss in the comments!
What sort of profile will we see? How far off course will our measurements be if we don't stick to schedule? Here's this setup run at a rate of 10,000 requests per second, with 25 threads (so each thread is trying for 250 reqs/sec or 1 request per 4ms):
[READ], Operations, 569516.0
[READ], AverageLatency(us), 1652.1852871561116
[READ], MinLatency(us), 210.0
[READ], MaxLatency(us), 142463.0
[READ], 95thPercentileLatency(ms), 1.0
[READ], 99thPercentileLatency(ms), 19.0
[READ], Return=0, 569516

According to these numbers, the max is quite high but the overall impact of hiccups is not too severe (all depends on your requirements of course). Even at this stage we can see that the effect of global pauses is skewing the other measurements (if you hit a short operation while a STW pause is in progress you still have to wait for the STW event to finish).
The corrected measurements tell a different story:
[Intended-READ], Operations, 569516.0
[Intended-READ], AverageLatency(us), 24571.6025835973
[Intended-READ], MinLatency(us), 268.0
[Intended-READ], MaxLatency(us), 459519.0
[Intended-READ], 95thPercentileLatency(ms), 83.0
[Intended-READ], 99thPercentileLatency(ms), 210.0

How can this be right? Can this be right?

At a rate of 10000 request per second, the unlikely Major Hiccup is likely to happen every second. Consider this next time someone tells you of a 99.99%ile behaviour. Given an event rate of 10K per second, 99.99% is suddenly not very rare. Consider that at this rate there's likely to be a few events that are worse.
The average major hiccup is 125ms long, in this time 125/4 events are delayed on all 25 threads -> 125 * 25 / 4 = 781 events are delayed from starting, they will further delay each other as they execute. In roughly 12 seconds we can see how it is quite probable that one of these events is another major hiccup. What with all the queuing up behind the first one etc, the pileup becomes quite reasonable.
The probability of a 'mode' is not the probability of the per event latency once STW and queuing effects are in play.

I've made the mock DB print out 'OUCH' every time we get slapped with a STW event. It turns out that we got very unlucky in this run and hit three of these in a row:
56 sec:
[READ: Count=9192, Max=83903, Min=238, Avg=1745.13, 90=1531, 99=26959, 99.9=79551, 99.99=83775]
[Intended-READ: Count=9208, Max=159999, Min=303, Avg=16496.92, 90=54271, 99=103807, 99.9=150527, 99.99=158335]
OUCH
OUCH
OUCH
57 sec:
[READ: Count=9642, Max=129727, Min=247, Avg=2318, 90=1799, 99=40607, 99.9=125631, 99.99=127359]
[Intended-READ: Count=9635, Max=459519, Min=320, Avg=102971.39, 90=200319, 99=374271, 99.9=442367, 99.99=457983]

This is quite telling.
The view on what's the worst second in this run is wildly different here. Because the uncorrected measurement takes each event as it comes it will take the view that 75 events were delayed by these hiccups, and none by more than 130ms. But from the corrected measurement point of view all the queued up measurements were effected and were further delayed by each other.
I've re-run, this time logging interval histograms in their compressed form for every second in the run. Logging a 60 seconds run with 1 second interval data cost me 200k (we can tweak the construction in OneMeasurementHdrHistogram to minimize the cost). I can take the compressed logs and use the HistogramLogProcessor script provided with HdrHistogram to process the logs (you need to copy the HdrHistogram.jar into the script folder first). Running:

./HistogramLogProcessor -i READ.hdr -o uncorrected -outputValueUnitRatio 1000

./HistogramLogProcessor -i Intended-READ.hdr -o corrected -outputValueUnitRatio 1000

Will produce *.hgrm files for both. I then use the plotFiles.html to generate the following comparison:

They tell very different stories don't they.
The red line will have you thinking your system copes gracefully up to the 99%ile slowly degrading to 20ms, when measuring correctly however the system is shown to degrade very quickly with the 20ms line crossed as early as the median, and the 99%ile being 10 times the original measurement. The difference is even more pronounced when we look at one of those terrible seconds where we had back to back STW hiccups. I can use the HistogramLogProcessor script to produce partial summary histograms for the 3 seconds around that spike:

./HistogramLogProcessor -i Intended-READ.hdr -o correctedOuch3 -outputValueUnitRatio 1000 -start 1425637666.488 -end 1425637668.492

Similarly we can compare a good second with no STW pauses:

Summary

Coordinated Omission is a common problem in load generators (and other latency reporters), we had a look at fixing YCSB, an industry standard load generator:

Replaced the data structure used to capture latency with HdrHistogram: that is just generally useful and gives us better data to work with when examining the corrected measurement
Found scheduling code and introduced notion of operation start time.
Found measuring code and captured both operation cost (uncorrected measurement) and scheduled time latency (corrected measurement).
Use a mock system under test to evaluate measurement of known scenario. This is a very handy thing to have and luckily YCSB had this facility in place. In other places you may have to implement this yourself but it's a valuable tool to have in order to better understand the measurement capabilities of your harness. This helped highlight the scale of scheduling inaccuracies and test harness overhead per operation, as well as the scale of test harness error during its own warmup period.
Use HdrHistogram facilities to visualise and analyse latency histogram data from the compressed histogram logs.

Thanks goes to this posts kind reviewers: Peter Huges, Darach, and Philip Aston

HdrHistogram: A better latency capture method

2015-02-16T10:28:00.000+00:00

Some years back I was working on a latency sensitive application, and since latency was sensitive it was a requirement that we somehow capture latency both on a transaction/event level and in summary form. The event level latency was post processed from the audit logs we had to produce in any case, but the summary form was used for live system monitoring. We ended up with a solution I've seen since in many places (mild variations on a theme), which is what I've come to think of as the linear buckets histogram:

The above data structure was an easy solution to a problem we had little time to solve, but it left much to be desired.
These days the histogram problem is gloriously solved by the HdrHistogram (High Dynamic Range), and though it's been around for a couple of years now I still find olden hand rolled histograms in many a code base. Cut that shit out boys and girls! Let me show you a better way :-)

HdrHistogram highlights:

Mature and battle tested, this data structure has been in production for many companies for a while now. Kinks have been unkinked and knickers untwisted.
Multi-lingual support: current implementations available in Java, C, C#, Erlang, Go and more are on the way.
Auto-resizing histograms (if you exceed your initial miximum value estimate)
Compact memory footprint supporting high precision of values across a wide range.
Compressed lossless serialization/de-serialization
Plotting scripts for gnuplot, a webby charting tool and an excel sheet chart
Lock-free concurrency support for recording and logging from multiple threads
Zero allocation on recording path (unless resizing which is optional, and then only if value exceeds initially specified max recorded value)
Constant time measurement which is less than cost of calling System.nanoTime() (on the cost, scalability and trustworthiness of nanoTime read Shipilev's excellent report)

It is truly as if the bees-knees and the dogs-bollocks had a baby, is it not?

Mama, what's a histogram?

"Histogram of arrivals per minute" by DanielPenfield
Own work. Licensed under CC BY-SA 3.0 via Wikimedia Common

Well, you could look it up on wikipedia, but the short answer is that a histogram is a summary of data in terms of the frequency of ranges of values. So given the following data set [1,2,3,...,100000] captured for a second I could summarize in several ways:

I could capture the whole range of values each value in a bucket of it's own, assigning each value a frequency of 1 per second.
I could have a single bucket for values between 0 and 100,000, this bucket will have the frequency of 100,000 times per second.

These are both a bit silly, the first is as bad as dealing with the full set of data, the second is telling us nothing about the way the 100,000 values break down within the range. Still these are the 2 extremes of histograms, the alternatives lie within that range in terms of the data they offer but there are many ways to skin a cat (apparently, :( poor cats):

Capture the values in 1000 buckets, each bucket representing a range of 100 values: [1..100][101..200]...[99,901..100,000] that will result in 1,000 buckets each with a frequency of 100. This is the sort of histogram described above where all buckets capture the same fixed range.
Capture the values in 17 buckets, each bucket K representing a range [2^K..(2^(K+1)-1)] e.g. [1..1][2..3][4..7]...[65,536..131,071]. This would be a good solution if we thought most values are likely to be small and so wanted higher precision on the lower range, with lower and lower precision for larger values. Note that we don't have to use 2 as the base for exponential histogram, other values work as well.

Both of the above solutions trade precision across a large range with storage space. In both solutions I am required to choose up-front the histogram precision and we expect to pay for a large range with either space or precision. Now that we realize what the variables are we can describe these solutions:

Linear buckets: For a range 0..R we will have to pay R/B space for buckets of range B. The higher R is the more space we require, we can compensate by picking a large B.
Exponential buckets: For a range 0..R we require space of log2 of R. The bucket size grows exponentially as we track higher values.

The problem we face with latency data points is that the range of values we want to capture is rather large. It is not unreasonable to expect the latency outliers to be several orders of magnitude larger than the typical observed measurement. For example, it may well be that we are timing a method whose cost is in the 100s of nanoseconds, or a high speed connection round trip on the order of 1-100µs but on occasion our latency is dominated by some stop the world GC pause, or network congestion, which is in the order of 10ms to a few seconds. How can we correctly size the range of our histogram? Given the possibility of multi-second GC delays we need to cover a typical range of 1000ns to 100,000,000,000ns. If we used a linear histogram with a 100µs bucket we'd need 1,000,000 buckets (assuming an int counter this will add up to a ~4MB data structure).

The HdrHistogram follows a different path to the above solutions and manages to accommodate a large range with a high precision across the range in a limited space.

How does it work?

Here's what the documentation has to say:

"Internally, data in HdrHistogram variants is maintained using a concept somewhat similar to that of floating point number representation: Using an exponent a (non-normalized) mantissa to support a wide dynamic range at a high but varying (by exponent value) resolution. AbstractHistogram uses exponentially increasing bucket value ranges (the parallel of the exponent portion of a floating point number) with each bucket containing a fixed number (per bucket) set of linear sub-buckets (the parallel of a non-normalized mantissa portion of a floating point number). Both dynamic range and resolution are configurable, with highestTrackableValue controlling dynamic range, and numberOfSignificantValueDigits controlling resolution."

Hmmm... I'll admit to having difficulty immediately understanding what's happening from the above text, precise though it may be. I had to step through the code to get my head around what works why, read the above again and let it simmer. I'm not going to dig into the implementation because, while interesting, it is not the point of this post. I leave it to the reader to pester Gil Tene(the author of HdrHistogram) with implementation related questions.

The principal idea is a mix of the exponential and linear histograms to support a dynamic range precision that is appropriate to the time unit scale. At the scale of seconds we have a precision of milliseconds, at the scale of milliseconds we have a precision of microseconds etc. This translates roughly into exponential scale buckets which have linear sub-buckets.

Example: From raw recording to histogram

I have posted a while back a Java ping utility which measures the round trip between minimal client and server processes. Each round trip was recorded into a large array and every set number of round trips the measurements were summarized in percentiles:

Recording raw data is the simplest way of capturing latency, but it comes at a price. The long[] used to capture the latencies is ~8MB in size, this is for a million samples and in a real application can grow without bounds until some cutoff point where we decide to summarize or discard the data. When we want to report percentiles we have to sort it and pick the relevant data points. This is not usually an acceptable solution (because of the memory footprint), but it offers absolute accuracy and is trivial to implement (until you have to consider serialization, concurrency and visualization, but otherwise trivial).

Replacing this measurement method with a histogram is straight forward:

This histogram is 31KB when using 2 decimal places precision which is good enough in most case (according to JOL, add it to your utility belt if it ain't there already. Increasing the precision to 3 increases the size to 220KB), which is a large improvement over 8MB. We could reduce the memory consumption further if we were willing to limit the maximum data point count per bucket and use an integer/short histogram(ints seem like a reasonable choice).

If we print both measurements methods for the same run we can see the difference between the raw data and the HDR representation which is naturally slightly less accurate (# lines are HdrHistogram,@ is for raw data, each line represents 1M data points):

We can see that reported percentiles are pretty close to the raw data:

Note that the nanoTime on Mac reports in µs granularity, which is why the real values(@ lines) all end with 3 zeros.
Note that the max/min reported are adjusted to the correct histogram resolution (not a big deal, but slightly surprising).

What would have happened with our hand rolled solution? To keep a range of 0 to 60,000,000,000ns in a linear histogram of the same memory footprint we would need to limit ourselves to roughly 220k/8=64k buckets. Each bucket would have a granularity of roughly 1ms which would have translated to a very limited visibility on the lower end of the spectrum as most data sets are actually all under 1ms. This would have also completely skewed our percentiles (i.e 99.99% results in 1ms bucket, no breakdown of behaviour in percentiles). We could try and tackle the issue by picking a lower range to cover (which if applied to HdrHistogram will minimize memory usage further) or by blowing the memory budget on finer grained buckets.

Example: Aeron samples, much percentiles! Such graphs! Wow!

Percentiles are commonly used to describe latency SLA of a given system and a typical application will report a range of percentiles to reflect the probability of a given response time. In this context we can say that a 99%ile latency of 1ms means that 99% of all requests were handled in =< 1ms.
Aeron is a low latency, reliable UDP messaging library. Since latency is an obvious concern, all the samples utilize the HdrHistogram lib to demonstrate measurement and report results, here are the relevant excerpts from the Ping sample:
This results in a pile of text getting poured into the console, not that friendly:
Fear not, HdrHistogram comes packed with a handy charty thingy! Here's what the above histogram looks like when plotted:

To get this graph:

Save output above to a text file
Open a browser, and go here (the same HTML is in the project here)
Choose your file, choose percentiles to report and unit to report in
Export the picture and stick it in your blog!

This histogram was provided to me by Martin Thompson (one of the Aeron developers) and is from a test run in a performance lab. We can see that Aeron is delivering a solid 7µs RTT up to the 90%ile where latency starts to gradually grow. In this particular data set the maximum observed latency was 38µs. This is a great latency profile. It is far more common for the max and 99.99%ile to be orders of magnitude more that the 90%ile.
I could similarly plot this histogram using a gnuplot script to be found here. The gnuplot script is very handy for scripted reporting on large runs. It also allows for plotting several histograms on the same graph to allow visual comparison between benchmark runs for instance.

Example: Compressed histogram logging

Because the SLA is often specified in percentiles, it is common for applications to log only percentiles and not histograms. This leads to a reporting problem as it turns out that percentiles output cannot be combined to produce meaningful average percentiles. The solution would be to log the full histogram data, but who want a log that grows by 31KB every 5 seconds just to capture one histogram? Worry not, HdrHistogram comes with a compressed logging format and log writer and all that good stuff:

How many bytes is that logged histogram costing us? The 31KB histogram compressed down to 1KB in my example, but the result of compression will depend on the histogram captured. It is fair to assume that histograms compress well as the array of buckets is full of zeros (on the byte level) as most buckets are empty or low count.
If 1KB sounds like allot consider that a days worth of 10s interval histograms will result in an 8MB log file, which seems pretty acceptable even if you have a hundred such files. The benefit is that you will now have high precision interval latency data that you can reliably use to create longer interval latency data. You can use the HistogramLogProcessor to produce a full or partial log summary histogram for plotting as above.
I believe there's some truly exciting data visualizations one could build on top of this data, but sadly thats not where my skills lie. If you got skills to show of in this area I'm sure HdrHistogram would value your contribution.

Example: jHiccup and concurrent logging

jHiccup is a pause measurement tool used to capture OS or JVM 'hiccups'. It deserves it's own post but I'll try and summarize it in a few points:

jHiccup runs a HiccupRecorder thread which sleeps for a period of time (configurable) and measures the delta between the wakeup time and actual time. The failure to be re-scheduled is taken as a potential OS/JVM hiccup. The size of the hiccup is recorded in a histogram.
jHiccup can be run as an agent in your own process, an external process, or both.
jHiccup has been ported to C as well.
People typically use jHiccup to help charecterize and diagnose disruptions to execution in their system. While not every hiccup is the result of a STW pause we can use the jHiccup agent evidence correlated with an external jHiccup process and the JVM gc logs to support root cause analysis. A significant hiccup is a serious sign of trouble meaning a thread was denied from scheduling for the length of the hiccup. We can safely assume in most cases that this is a sign that other threads were similarly disturbed.

Gil Tene originally wrote HdrHistogram as part of jHiccup, but as HdrHistogram turned out to be more generally useful the two were split. The reason I bring jHiccup up in this context is that it serves as a regularly maintained full blown real world example of using an HdrHistogram.
jHiccup has 2 interesting threads, with roles that parallel many real world applications out there:

The measuring thread(HiccupRecorder): This is the thread that sleeps and wakes up and so on. The rate at which it does that is potentially quite high and we don't want to skew to measurement by performing IO on this thread. Similarly many real world application will have critical threads where it is not desirable to introduce IO. Since this is the case actual persistence will be performed on another thread.
The monitoring/logging thread(HiccupMeter): This thread will wake up at regular intervals and write the last interval's histogram to the log file. But since it is reading a histogram while another thread is writing to the histogram we now need to manage concurrency.

HdrHistogram offers a synchronization facility to serve exactly this use case in the form of the Recorder:

Recording a value in the recorder is a wait free operation (on JDK8, can be lock free on older depending on the getAndAdd implementation for AtomicLongArray).
The Recorder also comes in a single-writer flavour, which minimizes the concurrency related overheads.

Under the covers the recorder is using an active histogram and an inactive one, swapped seamlessly when an interval histogram is requested. Using a Recorder looks much like using a normal histogram(full code here):
And that's concurrent logging sorted ;-).

Summary

With HdrHistogram now hitting version 2.1.4 and offering a wealth of tried and tested functionality along with cross platform implementations and a standardized compressed logging format it is definitely time you gave it a go! May your latencies always be low!
If you are looking for a pet project and have a gift for UI thingies a latency explorer for the histogram interval logs would be an awesome contribution!

MPMC: The Multi Multi Queue vs. CLQ

2015-01-19T12:38:00.000+00:00

{This post is part of a long running series on lock free queues, checkout the full index to get more context here}
JCTools, which is my spandex side project for lock-free queues and other animals, contains a lovely gem of a queue called the MpmcArrayQueue. It is a port of an algorithm put forward by D. Vyukov (the lock free ninja) which I briefly discussed in a previous post on ring-buffer queues.
The implementation is covered to a large extent in that post, but no performance numbers are provided to evaluate this queue vs. the JDK's own MPMC queue the illustrious ConcurentLinkedQueue (CLQ for short). Let's fix that!

Welcome to the Machine

Before we start the party, we must establish the test bed for these experiments. The results will only be relevant for similar setups and will become less indicative the more we deviate from this environment. So here it is, the quiet performance testing machine (shitloads of power and nothing else running):

OS: CentOS 6.4
CPU: Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz, dual socket, each CPU has 12 cores (24 logical core as HT is on). This is an Ivy Bridge, so decent hardware if not the fanciest/latest.
JVM: Oracle JDK 1.8u25
Load average (idle): 0.00, 0.02, 0.00 -> it's oh so quiet....

I'll be using taskset to pin threads to a given topology of producers/consumers which is important because:

Running 2 logical threads on same physical core will share queue in L1 cache.
Running on 2 physical cores will share queue in LLC.
Running on different socket (i.e 2 different CPUs) will force shared data via QPI.
Each topology has it's own profile, there's no point to averaging the lot together.

For the sake of these benchmarks I'll be pinning the producer/consumer threads to never share a physical core, but remain on the same CPU socket. In my particular hardware this means I'll use the 12 physical cores on one CPU using taskset -c 0-11.

For the sake of keeping things simple I ran the benchmarks several times and averaged the results, no fancy graphs (still recovering from the new year). Throughput results are quoted in ops per microsecond, if you are more comfortable with ops per second just multiply by 1 million. Latency results are in nanoseconds per operation.

Throughput Benchmarks: Hand rolled 1 to 1

Let's start with the simplest benchmark in JCTools. The QueuePerfTest is really a single producer/consumer test:

A producer thread is spun up which feeds into the queue as fast as it can.
The main thread plays the consumer thread and empties the queue as fast as it can.

The original benchmark was a sample used by Martin Thompson in one of his talks and I tweaked it further. It's nice and simple and has the added benefit of verifying that all the elements arrived at their destination. It also opens the door to loop unrolling which you may or may not consider a reasonable use case. This benchmark calls Thread.yield() if an offer/poll call fails which represents a rather forgiving backoff policy, which is why we also have the BusyQueuePerfTest which busy spins on fail.

Here's what we get:

Queue Benchmark Throughput(ops/µs)

CLQ QueuePerfTest 10.7

Mpmc QueuePerfTest 65.1

CLQ BusyQueuePerfTest 15.3

Mpmc BusyQueuePerfTest 63.5

The observed throughput for CLQ varies quite a bit between iterations, but the full run average is not too unstable. The performance also depends on the size of the heap as CLQ allocates a node per item passed and the GC overhead can become a significant part of management. In my runs I set the heap size to 1GB and the benchmark also calls System.gc() between measurement iterations. In any case, off to a good start, the JCTools queue is looking well.

Throughput benchmarks: JMH Joy

I'm a big JMH fan (see the JMH reference page), and these benchmarks show some of the power JMH has. The benchmark code is very simple and it's clear what's being measured:

We have 2 methods, offer and poll
Thread groups hammer each method
We use @AuxCounters to report successful/failed interactions with the queue
The throughput is the number of pollsMade, i.e. the number of items delivered

The nice thing is that once I got this setup I can now play with the number of threads in each group with no further investment (see an introduction to multi-threaded benchmarks with JMH). I can run several iterations/forks and go make some coffee while JMH crunches through the lot and gives me a nice report (JMH options used for these benchmarks: "-gc -f 5 -i 5 -wi 5 -p qType=MpmcArrayQueue,ConcurrentLinkedQueue -p qCapacity=1000000" the -gc option forces a GC cycle between iterations as before, I use the -tg option to control number of producer/consumer threads). Here's the results for QueueThroughputBackoffNone (again results are in ops/µs, so higher is better):

Queue 1P1C 2P2C 3P3C 4P4C 5P5C 6P6C

CLQ 7.9 ±1.7 , 4.7 ±0.2, 3.8 ±0.1, 3.7 ±0.4, 3.1 ±0.1, 2.9 ±0.1

Mpmc 68.4 ±11.1, 9.2 ±1.1, 7.3 ±0.5, 5.8 ±0.4, 5.2 ±0.4, 5.3 ±0.2

Note the columns stand for number of producers/consumers so 1P1C will be running with -tg 1,1 and will have one thread hitting the offer and one the poll. 2P2C will have 2 consumers and 2 producers etc.
So on the one hand, joy, MPMC still consistently ahead, on the other hand it is not some sort of magic cure for contention. If you have multiple threads hitting a shared resource you will suffer for it. The initial hit is the worst, but the following degraded performance curve isn't too bad and we seem to stay ahead.

Average Round Trip Time Benchmarks

This benchmark was set up to model a bursty load on queue implementations where we measure the time it takes for a burst of messages to travel from an originating thread, to a chain of threads inter-linked by queues, back to the same thread. This benchmark is focusing on near empty queues and the notification/wakeup time from when an item is placed in a queue until it becomes visible to another thread. It also highlights any batching optimisations impact as bursts grow larger.
The benchmark has been discussed in some detail in 2 previous posts(1, 2)
We have 2 axis we can explore here, burst size and ring size, I have not explore all the possibilities. The results seem quite straightforward and I'm short on time. I tested with burst size=1,10,100 in the default configuration, i.e. chain length 2 (so a ping-pong kind of setup), and just for the heck of it I ran the burst size=100 with a chain length of 8 (so 1->2->3...->7->8->1). There you go (results are now in nanoseconds per operation, lower is better):

Queue b=1,c=2 b=10,c=2 b=100,c=2 b=100,c=8

CLQ 488 ±11, 2230 ±98, 15755 ±601, 24886 ±2565
Mpmc 422 ±7, 1718 ±51, 5144 ±287, 8714 ± 200

Note that the headers shorthand stands for burst size(b) and chain length(c) so b=1,c=2 which is the default config for the benchmark stands for burst size of 1 (so send 1 message and wait until 1 message is received back) and chain length of 2 (so 2 threads linked up in this ring: 1->2->1).
The difference on single element exchange is not that great, but as burst size grows the gap widens significantly. This is perhaps down to the fact MPMC enjoys a much denser delivery of data, minimising the cache misses experienced as part of the message exchange. Note that extending the length of the chain seems to add the same percentage of overhead for each implementation, resulting in the same ratio for chain length 2 as we did for chain length 8 (MPMC is 3 times 'cheaper' than CLQ)

Summary

This was all rather dry, but I hope it helps people place the MPMC alternative in context. I would suggest you consider using MpmcArrayQueue in your application as a replacement to CLQ if:

You need a bounded queue
You are concerned about latency
You don't need to use the full scope of methods offered by Queue and can make do with the limited set supported in JCTools queues

The Escape of ArrayList.iterator()

2014-12-01T15:38:00.000+00:00

{This post assumes some familiarity with JMH. For more JMH related content start at the new and improved JMH Resources Page and branch out from there!}
Escape Analysis was a much celebrated optimisation added to the JVM in Java 6u23:

"Based on escape analysis, an object's escape state might be one of the following:

GlobalEscape – An object escapes the method and thread. For example, an object stored in a static field, or, stored in a field of an escaped object, or, returned as the result of the current method.

ArgEscape – An object passed as an argument or referenced by an argument but does not globally escape during a call. This state is determined by analyzing the bytecode of called method.

NoEscape – A scalar replaceable object, meaning its allocation could be removed from generated code.

After escape analysis, the server compiler eliminates scalar replaceable object allocations and associated locks from generated code. The server compiler also eliminates locks for all non-globally escaping objects. It does not replace a heap allocation with a stack allocation for non-globally escaping objects." - from Java 7 performance enhancements

Alas, proving an object never escapes is a difficult problem, and many people feel they cannot rely on this optimisation to kick in and "do the right thing" for them. Part of the problem is that there is no easy way to discover if a particular allocation has been eliminated (on a debug OpenJDK build one can use -XX:+UnlockDiagnosticVMOptions -XX:+PrintEscapeAnalysis -XX:+PrintEliminateAllocations).
The EscapeAnalysis skepticism leads some people to go as far as claim that the JIT compiler fails to eliminate the iterator allocation of collections, which are everywhere:

Buy why skepticize what you can analyze?

Theory: even simple iterators do not benefit from Escape Analysis

So lets give this some thought. I generally think the best of the JVM developer bunch. Sure they may miss here and there, they're just human after all, but we are talking about a good strong bunch of engineers. I tend to assume that if there is a simple case for an optimization that is worth while than they have not failed to capitalize on it. I would therefore be quite surprised if indeed iterators do not benefit from escape analysis as they seem quite natural candidates, particularly in the syntactic sugar case, but even in the explicit case. Still, my trust in the JVM engineers is no evidence, how can I prove this works for a given setup?

Use a debug build... I invite the readers to try this method out, didna do it.
Use a memory profiler, like the one packed with VisualVM or YourKit
Setup an experiment to observe before/after effect of desired optimization. Use -XX:-+DoEscapeAnalysis and examine gc logs to observe effect.
Look at the assembly...

Experiment: Observe a loop over an array list

Reach for your tool belts and pick the swiss army knife that is JMH. Here's the benchmark I will use for this:

This is one of them times when I'm not particularly concerned with the performance of this bit of code as such, but JMH makes a good crucible for java code. The same effort that went into correct measurement enables the examination of code in an isolated manner. This makes JMH a good tool for testing the JIT compiler.

Measurement: Profile the experiment

I want to plug the experiment into a profiler, so I set the number of iterations to 1000 and get into it man, profiling, doing it you know, like a... like a sex machine, can I count it all? Here's YourKit reporting:

Ouch! that array list iterator is right there at the top! all that trust I put in the JVM developers? GONE! Let's get a second opinion from JVisualVM, just to be sure:

Oh me, oh my, this is bad...
Finally, with tears in our eyes, let us try see what Java Mission Control has to say:

How curious! the iterator allocation is gone! but JMC is outnumbered 2 to 1 here. Could it be that some profilers are pooping their pants?

Measurement: Measure with -XX:+/-DoEscapeAnalysis

Sometimes we get lucky and the optimization we want to examine comes with a handy flag to turn it on and off. We expect escape analysis to remove the iterator allocation and thus leave us with a benchmark which generates no garbage. We are also fortunate that JMH comes with the handy GC profiler which simply examines the GC JMX bean to inform us if any collections happened. Running the benchmark with a short list size and a small heap should trigger plenty of young generation GC cycle in each measurement iteration. Lo and behold, with escape analysis on:

$java -jar target/microbenchmarks.jar -f 1 -i 10 -wi 10 -p size=1000 \
-jvmArgs="-Xms64m -Xmx64m -XX:+DoEscapeAnalysis" -prof gc ".*.sumIteratorOverList"
Benchmark (size) Score Error Units
sumIteratorOverList 1000 816.367 ± 17.768 ns/op
sumIteratorOverList:@gc.count.profiled 1000 0.000 ± NaN counts
sumIteratorOverList:@gc.count.total 1000 0.000 ± NaN counts
sumIteratorOverList:@gc.time.profiled 1000 0.000 ± NaN ms
sumIteratorOverList:@gc.time.total 1000 0.000 ± NaN ms

$java -jar target/microbenchmarks.jar -f 1 -i 10 -wi 10 -p size=1000 \
-jvmArgs="-Xms64m -Xmx64m -XX:-DoEscapeAnalysis" -prof gc ".*.sumIteratorOverList"
Benchmark (size) Score Error Units
sumIteratorOverList 1000 940.567 ± 94.156 ns/op
sumIteratorOverList:@gc.count.profiled 1000 19.000 ± NaN counts
sumIteratorOverList:@gc.count.total 1000 42.000 ± NaN counts
sumIteratorOverList:@gc.time.profiled 1000 9.000 ± NaN ms
sumIteratorOverList:@gc.time.total 1000 27.000 ± NaN ms

Now that is what we hoped for, escape analysis saves the day! Please note above numbers are from running on my laptop using Oracle Java 8u20, I was not aiming for measurement accuracy, just wanted to verify the compilation/GC aspects. Laptops are good enough for that.

WTF Profilers? Why you lie?

There's a big difference in how JVisualVM/YourKit work and how JMC work, in particular:

JVisualVM/YourKit: treat the JVM as a black box and profile via the JVMTI interfaces and bytecode injection. Work on all JVMs.
Java Mission Control: use internal JVM counters/reporting APIs only available to the Oracle JVM (so can't profile OpenJDK/Zing/IBM-J9 etc)

So why should this get in the way of escape analysis? Searching through the OpenJDK source code we can spot the following:

void C2Compiler::compile_method(ciEnv* env, ciMethod* target, int entry_bci) {

assert(is_initialized(), "Compiler thread must be initialized");

bool subsume_loads = SubsumeLoads;

bool do_escape_analysis = DoEscapeAnalysis &&!env->jvmti_can_access_local_variables();

This explains a similar issue, but not this one. The above code is relevant for debuggers and other tools relying on stepping through code and examining variable values.
This issue is the result of some instrumentation trapping the reference, at least for JVisualVM. We can look at the assembly to be certain, this is the iterator non-allocation before we connect JVisualVM to the process:
Note how the iterator is initialized with no allocation actually taking place (note also how line 4 annotation is a big fat lie, this is actually getting modCount from the list). Method is recompiled to same when we attach the agent. And here's what happens when we turn on memory profiling: The iterator is now allocated (see 'new' keyword at line 18 above).We can see that the instrumenting agent added a traceObjAlloc call into the iterator constructor (line 52 above). JVisualVM is open source, so we can dig into the code of the above trace method and the instrumentation code, I leave it as an exercise to the reader. If I was a better man I might have a go at fixing it, maybe another day.

Summary

EscapeAnalysis works, at least for some trivial cases. It is not as powerful as we'd like it, and code that is not hot enough will not enjoy it, but for hot code it will happen. I'd be happier if the flags for tracking when it happens were not debug only.
Profilers can kill EscapeAnalysis. In the example above we have very little code to look at, so it is easy to cross examine. In the real world you'd be profiling a much larger codebase, looking for allocation elimination opportunities, I suggest you have a healthy doubt in your profiler.
Java Mission Control is my current favourite profiler for the Oracle JVM. It's a shame the underlying APIs are not made public and standard so that tool makers can rely on them for all JVMs. Perhaps one day at least part of these APIs will be made public.

Thanks Peter Lawrey for reviewing this post!

The Mythical Modulo Mask

2014-11-03T11:16:00.000+00:00

It is an optimisation well known to those who know it well that % of power of 2 numbers can be replaced by a much cheaper AND operator to the same power of 2 - 1. E.g:
x % 8 == x & (8 - 1)
[4/11/2014 NOTE] This works because the binary representation for N which is a power of 2 will have a single bit set to 1 and (N-1) will have all the bits below that set to 1 (e.g 8 = 00001000, 8-1= 00000111). When we do x AND (N-1) only the remainder of x / N will match the N-1 mask.
[4/11/2014 NOTE + slight spoiler: this only works when x >= 0]
[15/02/2016 NOTE: It has been pointed out that % is not modulo, but remainder. Why everyone calls it modulo I'm not sure]
The reason the & is so much cheaper is because while % is implemented using the DIV instruction, & is just AND and as it turns out DIV is expensive and AND is cheap on x86 CPUs (and other places too I think). The optimisation is used in the Disruptor as well as the JCTools circular array queues and in the ArrayDequeue and other JDK classes. Is it time to replace % with & everywhere in your code which has this opportunity?
[4/11/2014 NOTE] Technical term for this sort of optimization is Strength Reduction.

Starting Simple

Lets start with some basic benchmarks:
And the results (on JDK8u5/E5-2697 v2 @ 2.70GHz/-XX:-UseCompressedOops for consistency between assembly and results):
Benchmark Score error Units
moduloLengthNoMask 3.448 ± 0.007 ns/op
moduloConstantLengthNoMask 1.150 ± 0.002 ns/op
moduloLengthMask 1.030 ± 0.006 ns/op
moduloMask 0.862 ± 0.001 ns/op
consume 0.719 ± 0.001 ns/op
noop 0.287 ± 0.000 ns/op

So pretty much as per expectation the modulo operation is far more expensive than the mask:

The clever JIT is aware of the optimisation opportunity and will replace a constant % with the &. It is not a perfect replacement, but pretty close.
At this sort of low digit ns benchmark we can’t make a statement such as “modulo is 4 times more expensive” because the same machine produces a baseline of 0.287ns/op for the noop benchmark and 0.719ns/op for the consume benchmark. If we deduct the consume result from the other scores we see a 1 : 25 ratio between the costs. Is that a good way to model performance? not really either, performance is not additive so simply subtracting one cost from the other doesn't really work at this scale. The truth is somewhere fuzzy in between and if we really care we should look at the assembly.
It seems that using a pre-calculated mask field is more awesome than using the "array length - 1" as a mask. That is consistent with the expectation that the re-calculation of the mask on the fly, as well as loading the value to be used for that calculation, is more expensive than using the pre-calculated field.

I love it when a plan comes together...

Going Deeper

The reason we wanted the modulo in the first place was to read from the array, right? so let’s try that:

And the results:

Benchmark Score error Units
readByLengthNoMask 3.736 ± 0.005 ns/op
readByConstantLengthNoMask 1.437 ± 0.001 ns/op
readByMask 1.347 ± 0.022 ns/op
readByLengthMask 1.181 ± 0.049 ns/op
readNoMask 1.175 ± 0.004 ns/op
Well, what’s this I see? "length-1" mask is leading the chart! How’d that happen?

To quote from the famous “Jack and the FlumFlum Tree”:

“Don’t get your knickers in a twist!” said Jack,
“Let’s have a look in the patchwork sack.”

Lets start with the generated assembly for the constant modulo:

I didna see that one coming! the modulo on a constant is not your garden variety & mask affair since it turns out our original assertion about the mask/modulo equality is only true for positive numbers. The JIT in it’s wisdom is dealing with the negative case by doing (x = -x; x = x&15; x = -x;).
I think the above case could be made a tiny bit faster by switching the branch around (so jump for negative value). It’s easy however to see what happens if we simplify the constant version further by using a constant mask:
And results:
Benchmark Score error Units
moduloConstantLengthNoMask 1.150 ± 0.002 ns/op
moduloConstantLengthMask 0.860 ± 0.001 ns/op
readByConstantLengthNoMask 1.437 ± 0.001 ns/op
readByConstantLengthMask 1.209 ± 0.017 ns/op
So minor joy on the modulo, and reading is better than plain mask, nearly as good as the "length-1" mask. Oh well, let's move on.
The big surprise was the mask calculated on the fly from the array length version. How can calculating the mask on the fly, which seemed to be slower, end up being faster when reading from the array? Who feels like more assembly?
I was hoping the JVM was clever enough to remove the array bound checks, but that didn’t happen. What’s happening here is that the length load serves the purpose of both creating the mask and checking the bounds. This is not the case for the mask version where we load the mask for the index calculation and the length for the bounds check, thus paying for 2 loads instead of one:
So removing the computation did not make a difference because the bound check requires the extra load of the length anyhow, can we make the bounds check go away? Of course we can, but it’s Unsafe!!! Let’s do it anyways!

The assembly:

Shazzam! no bounds check, but look at all the work that’s gone into the unsafe read of the array. It would have been so much better if the unsafe read enjoyed the same addressing mode as normal array reads like so “r8d,DWORD PTR [r9+r10*4+0x18]”, but it seems the JIT compiler is not recognising the opportunity here. What’s the performance like?
Benchmark Score error Units
readByMask 1.347 ± 0.022 ns/op
readByLengthMask 1.181 ± 0.049 ns/op
readNoMask 1.175 ± 0.004 ns/op
unsafeReadByMask 1.152 ± 0.001 ns/op

This is even better than no mask at all. Yay?

Well… sort of. If you mean to have the fastest ‘get’ from the array that allows for an array size which is not an application constant, than this is a mini-win. In particular is saves you a load of the array length in this case and loads can cost anything really. In the case where index and mask are long we can get better code generated:

But performance is much the same for this case. Seems like there’s not much left to win in this case.

For completeness sake we can compare the no mask result with an Unsafe equivalent:

Benchmark Score error Units
unsafeReadByNoMask 1.038 ± 0.022 ns/op

readNoMask 1.175 ± 0.004 ns/op

So it seems slipping past the array boundary check is worth something, but is it generally worth it? what if we weren't dealing with just the one element?

Bound Check Elimination

Looking at the above optimisation we need to accept that it is probably only worth it if array bound checks happen on every access. If we now compare a sum over an array:

We get the following results (length=100):

Benchmark Score error Units
loopOverArrayStraight 26.855 ± 0.060 ns/op
loopOverArrayUnsafeInt 41.413 ± 0.056 ns/op
loopOverArrayUnsafeLong 76.257 ± 0.171 ns/op
Oh Unsafe, why you so sucky sucky? How come the unsafe versions suck so significantly? isn’t Unsafe the cure to all performance problems?
Once the bounds check is eliminated by the JIT we can see that for the UnsafeInt we have the same issue with addressing conversion, only now the cost is not compensated for by the bound check removal. The UnsafeLong version is even worse, how come?
The generated loop for the int case is long and boring because it’s unrolled, the long case is pretty small:
2 'bad' things just happened:

Addressing didn’t workout the way we’d like. Instead of the desired “mov r11d,DWORD PTR [r9+rdi*4+0x18]” we get a two stage setup where we do:”lea r10,[r9+rdi*4]” and then “add r11d,DWORD PTR [r10+0x18]”. Bummer.
We got a safe point poll in the loop. This is happening because long indexed loops are considered potentially very long (as opposed to shorter int loops... heuristics for time to safe point) and so include a safe point poll.

So we want to fix the addressing mode and stick to having an int index. If we were to insist on using Unsafe (perhaps because we are trying to do this with off heap memory instead of an array) we’d have to do this:
[4/11/2014 NOTE] Note that what we really want here is more than just getting rid of the multiplication/widening, we want the JIT to identify the expression calculated for offset as relative array access and pick the correct addressing mode for MOV to use. There are clever people out there trying to make sure this will work better in the future.
This removes the need for a safe point poll and simplifies addressing to the point where we nearly match the iteration over the array case (length=100):

Benchmark Score error Units
loopOverArrayStraight 26.855 ± 0.060 ns/op

  loopOverArrayUnsafePointer   27.377 ± 0.049  ns/op

We can explore the relationship between the implementations by testing for different array sizes:

            10   100     1000    10000

straight    4.3  26.8    289.2   2883.7

unsafeP     4.8  27.3    296.1   2886.4

So it seems that the smaller the array the more relative advantage the array iteration has when iterating in this fashion. This should not really be surprising, there's nothing here to confuse the JIT compiler and iterating over arrays is important enough to optimize. We have to work hard to get close to the JIT compiler when it does what it does best.

Summary

We had a simple optimisation in mind, replace a % with &:

Observed that for the case where constants are used the JIT is able to perform that optimisation for us almost as well as we’d do ourselves (we have no way of specifying positive only modulo, i.e uint).
We proved the viability of the optimisation in 2 variations, using a pre-calculated mask field and using (array.length - 1)
Using the optimisation in the context of a circular array read showed an interesting reversal in performance. We observed the cause of this reversal to be the array.length load for the purpose of bound checks reused for the calculated mask as opposed to the re-calculated.
Using Unsafe we managed to bypass the array bound check and get the best result using the mask for a single read.
When we try the same method naively in a loop (over the whole array) array bound check is eliminated and plain old array access is the best performer.
To regain the performance for Unsafe access we have to tweak the code to avoid safe point polling as well as to get the addressing mode we want in the resulting assembly. Even then plain array access is better for smaller arrays.

Simple innit?
Some notes on methodology:

I ran the same experiments on different intel generations, you get different results but the assembly remains the same. E.g. on older CPUs the maximum instructions per cycle would be less than on the Ivy Bridge CPU I've used here, this will lead to instruction spilling over to the next cycle. The L1 latency could be higher leading to loads dominating the costs etc. This ends up giving a slightly different balance to compute vs. memory load. Overall analysis holds.
Using -XX:-UseCompressedOops was done for the sake of consistent assembly and results. Using compressed oops makes loads look a bit clumsier and I wanted to have less to explain. But running with the flag on (as it is by default) also effects results on this scale. In particular because the compressed oops require a shift to be used and shifters are a limited resource the CPU (1 on Westmere, 2 on Ivy Bridge) it can end up adding a cycle to the results.
Running these same experiments on a laptop was good for getting the assembly out and a vague sense of scale for results, but measurements had far greater error in that environment. Also note that laptops and desktops tend to be a generation ahead of servers where processors are concerned.
An interesting experiment would be to look at same experiment with the JMH perfasm profiler. I did that but could not figure out how to get Intel syntax out of it and so for consistency sake stuck with what I had. Left as an exercise to the reader :P

Many thanks to J.P. Bempel and Peter Hughes for reviewing, any issues remaining were added by me after they reviewed the post.

Psychosomatic, Lobotomy, Saw

How Inlined Code Makes For Confusing Profiles

Mommy, What is Inlining?

How Much Inlining Can We Do?

How Much Inlining Does Java Actually Do?

Where Do Profilers Come In?

A Confusing Reality

MOAR, BIGGER, CONFUSION

Summary

What a difference a JVM makes?

A JVM gets into trouble

A JVM gets out of trouble

Put some Zing in that thing

Oh Shenandoah

A Holy Graal!

Summary: There Are Many JVMs In My Father's House

Java Flame Graphs Introduction: Fire For Everyone!

What's the big deal?

Synthetic Samples For Starters

Tree View vs Flames

Everybody Gets A FlameGraph!

Level Up: AsyncGetCallTrace or JFR

Bonus: Diff Profiles!

Further Bonus: Icicle Graph

Level UP++: Java Perf Flame Graphs FTW!

Bonus: Inlined Frames! Threads! COLOR!

Bonus: Hack It All Up

Summary And Credits

What do Atomic*::lazySet/Atomic*FieldUpdater::lazySet/Unsafe::putOrdered* actually mean?

Linked Array Queues, part 2: SPSC Benchmarks

Setup:

Throughput benchmark: background and method

Throughput benchmark: baseline(JMH params: -bm thrpt -tu us)

Throughput benchmark: growable

Throughput benchmark: chunked

Throughput benchmark: unbounded

Throughput benchmark: summary

Interlude

Burst "cost"/latency benchmark: background and method

Burst Cost benchmark: baseline

Burst Cost benchmark: growable

Burst Cost benchmark: chunked

Burst Cost benchmark: unbounded

Burst Cost benchmark: summary

Mmmm... this is boring :(

Linked Array Queues, part 1: SPSC

Hot Path offer/poll

Cold Path poll

Cold Path offer: Unbounded

Cold Path offer: Chunked

Cold Path offer: Growable

Performance?

TO BE CONTINUED...

4 Years Blog Anniversary

Fixing Coordinated Omission in Cassandra Stress

PART I: Demonstrating The Issue

Part II(a): It Puts The Measurements In The HdrHistogram or It Gets The Hose Again!

Part II(b): Where Do I Start?

Part II(c): Who Reports What?

Part III: What Is It Good For?

Summary: What's new?

The Pros and Cons of AsyncGetCallTrace Profilers

A Lightweight Honest Profiler

What Does AGCT Do?

The Good News!

Collection errors: Runtime Stubs

Blind Spot: Sleeping code

Error Margin: Skidding + inlining

Summary: What is it good for?

GC 'Nepotism' And Linked Queues

Your Dead Olds Relatives Can Promote You!

Linked Queues: Pathological Nepotism

An Exercise To the Reader

Open questions/discussion:

Why (Most) Sampling Java Profilers Are Fucking Terrible

How Sampling Execution Profilers Work (in theory)

How Do Generic Commercial Java Sampling Execution Profilers Work?

Safepoint Sampling: Theory

Safepoint Sampling: Reality

Mind The Gap

What do Atomic::lazySet/AtomicFieldUpdater::lazySet/Unsafe::putOrdered* actually mean?