Psychosomatic, Lobotomy, Saw: April 2013

Sunday, 28 April 2013

Writing Java Micro Benchmarks with JMH: Juicy

{UPDATE 03/09/14: If you come here looking for JMH related content start at the new and improved JMH Resources Page and branch out from there!}
Demonstrating use of JMH and exploring how the framework can squeeze every last drop out of a simple benchmark.
Writing micro benchmarks for Java code has always been a rather tricky affair with many pitfalls to lookout for:

JIT:

Pre/Post compilation behaviour: After 10K(default, tuneable via -XX:CompileThreshold) invocations your code with morph into compiled assembly making it hopefully faster, and certainly different from it's interpreted version.
Specialisation: The JIT will optimise away code that does nothing (i.e. has no side effects), will optimise for single interface/class implementations. A smaller code base, like a micro-benchmark, is a prime candidate.
Loop unrolling and OSR can make benchmark code (typically a loop of calls to the profiled method) perform different to how it would in real life.

GC effects:

Escape analysis may succeed in a benchmark where it would fail in real code.
A buildup to a GC might be ignored in a run or a collection may be included.

Application/Threading warmup: during initialisation threading behaviour and resources allocation can lead to significantly different behaviour than steady state behaviour.

Environmental variance:

Hardware: CPU/memory/NIC etc...
OS
JVM: which one? running with which flags?
Other applications sharing resources

Here's a bunch of articles on Java micro-benchmarks which discuss the issue further:

Robust Java Benchmarking
The perils of benchmarking under dynamic compilation
How to write a benchmark?(from StackOverflow)
Read the footnotes of the JMH samples for further highlights on the topic.

Some of these issues are hard to solve, but some are addressable via a framework and indeed many frameworks have been written to tackle the above.

Let's talk about JMH

JMH (Java Micro-benchmarks Harness or Juicy Munchy Hummus, hard to say as they don't tell you on the site) is the latest and as it comes out of the workshop of the very people who work hard to make the OpenJDK JVM fly it promises to deliver more accuracy and better tooling then most.
The source/project is here and you will currently need to build it locally to have it in your maven repository, as per the instructions. Once you've done that you are good to go, and can set yourself up with a maven dependency on it.
Here is the project I'll be using throughout this post, feel free to C&P to your hearts content. It's a copy of the JMH samples project with the JMH jar built and maven sorted (see update below on current state of samples) and all that so you can just clone and run without setting up JMH locally. The original samples are pure gold in terms of highlighting the complexities of benchmarking, READ THEM! The command line output is detailed and informative, so have a look to see what hides in the tool box.
I added my sample on top of the original samples, it is basic (very very basic) in it's use of the framework but the intention here is to help you get started, not drown you in detail, and give you a feel of how much you can get out of it for very little effort. Here goes...

It's fun to have fun, but you've got to know how

For the sake of easy comparison and reference I'll use JMH to benchmark the same bit of code I benchmarked with a hand rolled framework here and later on with Caliper here. We're benchmarking my novel way of encoding UTF-8 strings into ByteBuffers vs String.getBytes() vs best practice recommendation of using a CharsetEncoder. The benchmark compares the 3 methods by encoding a test set of UTF-8 strings samples.
Here's what the benchmark looks like when using JMH:

We're using three JMH annotations here:

State - This annotation tells JMH how to share benchmark object state. I'm using the Thread scope which means no sharing is desirable. There are 2 other scopes available Group (for sharing the state between a group of threads) and Benchmark (for sharing the state across all benchmark threads).
Setup - Much like the JUnit counterpart the Setup annotation tells JMH this method needs to be called before it starts hammering my methods. Setup methods are executed appropriately for your chosen State scope.
GenerateMicroBenchmark - Tells JMH to fry this method with onions.

A lot of good tricks, I will show them to you, your mother will not mind at all if I do

To get our benchmarks going we need to run the generated microbenchmarks.jar. This is what we get:

Nice innit?
Here's the extra knobs we get on our experiment for our effort:

I'm using some command line options to control the number of iterations/warmup iterations, here's the available knobs on that topic:

i - number of benchmarked iterations, use 10 or more to get a good idea
r - how long to run each benchmark iteration
wi - number of warmup iterations
w - how long to run each warmup iteration (give ample room for warmup, how much will depend on the code you try and measure, try and have it execute 100K times or so)

To choose which benchmarks we want to run we need to supply a regular expression to filter them or ".*" to run all of them. If you can't remember what you packed use:

v - verbose run will also print out the list of available benchmarks and which ones were selected by your expression
l - to list the available benchmarks

If you wish to isolate GC effects between iterations you can use the gc option, this is often desirable to help getting more uniform results.
Benchmarks are forked into separate VMs by default. If you wish to run them together add "-f 0" (you shouldn't really do this unless you are trying to debug something... forking is good). The framework also allows you to run several forks for each benchmark to help identify run to run variance.

The output is given for every iteration, then a summary of the stats. As I'm running 3 iterations these are not very informative (this is not recommended practice and was done for the sake of getting sample outputs rather than accurate measurement, I recommend you run more than 10 iterations and compare several JMH runs for good measure) but if I was to run 50 iterations they'd give me more valuable data. We can choose from a variety of several output formats to generate graphs/reports later. To get CSV format output add "-of csv" to your command line, which leaves you to draw your own conclusions from the data (no summary stats here):

The above has your basic requirements from a benchmark framework covered:

Make it easy to write benchmarks
Integrate with my build tool
Make it easy to run benchmarks (there's IDE integration on the cards to make it even easier)
Give me output in a format I can work with

I'm particularly happy with the runnable jar as a means to packaging the benchmarks as I can now take the same jar and try it out on different environments which is important to my work process. My only grumble is the lack of support for parametrization which leads me to use a system property to switch between the direct and heap buffer output tests. I'm assured this is also in the cards.

I will show you another good game that I know

There's even more! Whenever I run any type of experiment the first question is how to explain the results and what differences one implementation has over the other. For small bits of code the answer will usually be 'read the code you lazy bugger' but when comparing 3rd party libraries or when putting large compound bits of functionality to the test profiling is often the answer, which is why JMH comes with a set of profilers:

gc: GC profiling via standard MBeans
comp: JIT compiler profiling via standard MBeans
cl: Classloader profiling via standard MBeans
hs_rt: HotSpot (tm) runtime profiling via implementation-specific MBeans
hs_cl: HotSpot (tm) classloader profiling via implementation-specific MBeans
hs_comp: HotSpot (tm) JIT compiler profiling via implementation-specific MBeans
hs_gc: HotSpot (tm) memory manager (GC) profiling via implementation-specific MBeans
hs_thr: HotSpot (tm) threading subsystem via implementation-specific MBeans
stack: Simple and naive Java stack profiler

Covering the lot exceeds the scope of this blog post, let's focus on obvious ones that might prove helpful for this experiment. Running with the gc and hs_gc profiler (note: this should be done with fixed heap for best result, just demonstrating output here) give this output:

The above supports the theory that getBytes() is slower because it generates more garbage than the alternatives, and highlights the low garbage impact of custom/charset encoder. Running with the stack and hs_rt profilers gives us the following output:
What I can read from it is that getBytes() spends less time in encoding then the other 2 due to the overheads involved in getting to the encoding phase. Custom encoder spends the most time on on encoding, but what is significant is that as it outperforms charset encoder, and the ratios are similar we can deduce that the encoding algorithm itself is faster.

But that is not all, oh no, that is not all

The free functionality does not stop here! To quote Tom Waits:

It gets rid of your gambling debts, it quits smoking
It's a friend, and it's a companion,
And it's the only product you will ever need
Follow these easy assembly instructions it never needs ironing
Well it takes weights off hips, bust, thighs, chin, midriff,
Gives you dandruff, and it finds you a job, it is a job
...
'Cause it's effective, it's defective, it creates household odors,
It disinfects, it sanitizes for your protection
It gives you an erection, it wins the election
Why put up with painful corns any longer?
It's a redeemable coupon, no obligation, no salesman will visit your home

'What more?' you ask, well... there's loads more functionality around multi threading I will not attempt to try in this post and several more annotations to play with. In a further post I'd like to go back and compare this awesome new tool with the previous 2 variations of this benchmark and see if, how and why results differ...
Many thanks to the great dudes who built this framework of whom I'm only familiar with Master Shipilev (who also took time to review, thanks again), they had me trial it a few months back and I've been struggling to shut up about it ever since :-)

Related JMH posts:

UPDATE (1/08/2013): If you are looking for more JMH related info, see Shipilev's slides on benchmarking.
UPDATE (1/07/2014): The samples repository has been updated to reflect JMH progress, sample code may have minor differences from the code presented above and command line options may differ. I will follow up at some point with a re-vamp of JMH related posts but for now you can always reproduce above results by reviving the older code from history.

Sunday, 7 April 2013

135 Million messages a second between processes in pure Java

{This post is part of a long running series on lock free queues, checkout the full index to get more context here}
Porting an existing single producer/single consumer concurrent queue into an IPC mechanism via memory mapped files and getting 135 million messages throughput in pure Java.
In my previous post I covered a single producer/consumer queue developed and shared by Martin Thompson capable of delivering an amazing 130M messages per second. The queue he delivered is a great tool for communicating between threads, but sometimes communicating between threads is not enough. Sometime you need to leave your JVM and go out of process. Inter Process Communications (IPC) is a different problem to inter thread communications, can it be cracked by the same approach?

IPC, what's the problem?

Inter Process Communication is an old problem and there are many ways to solve it (which I will not discuss here). There are several attractions to specialized IPC solutions for Java:

Faster than socket communication.
An out of process integration option with applications written in other languages.
A means of splitting large VMs to smaller ones improving performance by allowing GC and JIT specialization.

For IPC to be attractive it has to be fast, otherwise you may as well go for network based solutions which would extend beyond your local machine uniformly. I attended an Informatica conference a while back and got talking to Todd Montgomerey about the Disruptor and mechanical sympathy, he suggested that IPC should be able to perform as well as inter thread messaging. I found the idea interesting and originally meant to port the Disruptor, but Martin's queue is simpler (and quicker) so I went for that instead. Starting with a good algorithm/data structure is very good indeed, now I just needed to bridge the gap and see if I can maintain the benefits.

Off the heap we go!

To do IPC we must go off heap. This has several implications for the queue, most importantly references are not supported. Also note persistence to and from the queue is required, though one could extend my implementation to support a zero copy interaction where a struct is acquired, written and committed instead of the offer method, and similarly acquired, read and finally released instead of the poll method. I plan to make several flavours of this queue to test out these ideas in the near future.

My IPC queue uses a memory mapped file as a means of acquiring a chunk of shared memory, there is no intention to use the persisted values though further development in that direction may prove interesting to some. So now that I got me some shared memory, I had to put the queue in it.

I started by laying out the queue counters and cached counters. After realizing the counters need to be aligned to work properly I learnt how to align memory in Java. I went on to verify that aligned memory offers the guarantees required for concurrent access. Quick summary:

aligned access means writing data types to addresses which divide by their size.
unaligned access is not atomic, which is bad for concurrency :(
unaligned access is slow, which is bad for performance :(
unaligned access may not work, depending on OS and architecture. Not working is very bad :(

Sorting out alignment is not such a big deal once you know how it works. One of the nice things about going off-heap was that solving false sharing has become far more straightforward. Move your pointer and you're in the next cache line, job done. This left me rather frustrated me with the tricks required to control memory layout in Java. Going back to the original implementation you will notice the Padded classes who's role it is to offer false sharing protection. They are glorious hacks (with all due respect) made necessary by this lack of control. The @Contended annotation coming in JDK 8 will hopefully remove the need for this.
This is how the memory layout worked out:

To illustrate in glorious ASCII graphics (each - is a byte), this is what the memory layout looks like when broken into cache lines:
|--------|--------|--------|head....|--------|--------|--------|--------|
|--------|--------|--------|tailCach|--------|--------|--------|--------|

|--------|--------|--------|tail----|--------|--------|--------|--------|

|--------|--------|--------|headCach|--------|--------|--------|--------|

I played around with mixing off heap counters with on heap buffer but in the interest of brevity I'll summarize and say the JVM does not like that very much and the end result performance is not as good as all heap/off-heap solutions. The code is available with everything else.

Once alignment and memory layout were sorted I had to give up the flexibility of having reference pointers and settle for writing my data (an integer) directly into the memory. This leaves my queue very restrictive in it's current form. I intend to revisit it and see what I can do to offer a more extendable API on top of it.

Let me summarize the recipe at this point:

Create a memory mapped file large enough to hold:

4 cache lines for counters/cached counters.
4 bytes(per integer) * queue capacity (must be a power of 2).
1 spare cache line to ensure you can align the above to the cache line.

Get a mapped byte buffer, which is a direct byte buffer on top of the mapped memory.
Steal the address and get the contained aligned byte buffer.
Setup pointers to the counters and the beginning of the buffer
Replace use of natural counters with off heap counters accessed via Unsafe using the pointers.
Replace use of array with use of offset pointers into buffer and Unsafe access.
Test and debug until you work out the kinks...

The above code should give you a fair idea how it works out and the rest is here. This queue can work in process and out of process as demonstrated in the tests included in the repository. Now that it works (for the limited use case, and with room for further improvement... but works), is it fast enough? not so fast? is it...<gasp> ... FASTER????!?!?!

Smithers, release the hounds

Here are the numbers for using the different implementations in process:

Implementation/Affinity	Same core	Cross core	Cross socket
P1C1QueueOriginal3	110M	130M	19M
P1C1OffHeapQueue	130M	220M	200M
P1C1QueueOriginalPrimitive	124M	220M	215M

Confused? Let me explain. First line is the measurements taken for the original queue. Similar to what was presented in prev. post, though I saw a slight improvement in the results with increasing the compile threshold to 100000.
The second line is my offheap implementation of same algorithm. It is significantly faster. This is not IPC yet, this is in process. The reason it is faster is because data is inlined in the queue, which means that by loading an entry in the queue we get the data as opposed to a reference to the data. Getting a reference is what you get when you have and Object[] array. The array holds the references and the data is elsewhere, this seems to make it more painful as we get further from the producer.
The last entry is a mutation of P1C1QueueOriginal3 into a primitive array backed queue to compare performance like for like. As you can see this displays very similar results to the off heap implementation supporting the theory that data in-lining is behind the observed performance boost.
The lesson here is an old one, namely that pointer chasing is expensive business further amplified by the distance between the producing CPU and consuming CPU.
The off-heap queue can offer an alternative to native code integration as the consuming thread may interact directly with the off-heap queue and write results back to a different off-heap queue.
Running a similar benchmark adapted to use a memory mapped file as the backing DirectByteBuffer for the off-heap queue we get:
    same core - ops/sec=135M
    across cores - ops/sec=98M
    across sockets - ops/sec=25M

JOY! a pure Java IPC that gives you 135M messages per second is more throughput then you'd get with most commercial products out there. This is still not as fast as the same queue in process and I admit I'm not sure what the source of the performance difference is. Still I am quite happy with it.
A few notes/observations from the experimentation process:

I got a variety of results, stabilizing around different average throughputs. I chose the best for the above summary and plan to go into detail about the results in the near future.
The JVM was launched with: -XX:+UseCondCardMark -XX:CompileThreshold=100000
Removing the Thread.yield from the producer/consumer loops improved performance when running on the same core, but made it worse otherwise.
Moving the queue allocation into the test loop changes the performance profile dramatically.
I've not had time to fully explore the size of the queue as a variable in the experiment but the little I've done suggests it makes a difference, choose the right size for your application.

I realize this post is rather less accessible than the previous one, so if you have any questions please ask.