Friday, 28 March 2014

Java Object Layout: A Tale Of Confusion

Buy the T-shirt
Following the twists and turns of the conversation on this thread in the Mechanical Sympathy mailing list highlights how hard it is to reason about object layout based on remembered rules. Mostly every person on the thread is right, but not completely right. Here's how it went...


The thread discusses False Sharing as described here. It is pointed out that the padding is one sided and padding using inheritance is demonstrated as solution. The merits of using inheritance vs. using an array and utilizing Unsafe to access middle element (see Disruptor's Sequence) vs. using AtomicLongArray to achieve the same effect are discussed (I think the inheritance option is best, as explored here). And then confusion erupts...

What's my layout?

At this point Peter L. makes the following point:
[...]in fact the best option may be.
class Client1 {
    private long value;
    public long[] padding = new long[5];
What follows is a series of suggestions on what the layout of this class may be.

Option 1: My recollections

I was too lazy to check and from memory penned the following:
[...] the ordering of the fields may end up shifting the reference (if it's 32bit or CompressedOop) next to the header to fill the pad required for the value field. The end layout may well be:
12b header
4b padding(oop)
8b value

Option 2: Simon's recollections

Simon B. replied:
[...] I thought Hotspot was laying out longs before
references, and that the object header was 8 Bytes.
So I would expect Client1 to be laid out in this way:
8B header
8B value
4B padding (oop)
[...] Am I out of date in object layout and header size ?

Option 3: TM Jee's doubts

Mr. Jee slightly changed the class and continued:
class Client1 {
    private long value;
    public long[] padding = new long[5]
    public Object[] o = new Object[1];
the memory layout should be something like
12b header (or is it 16b)
8b value
4b for the long[] (its just the reference which is 4b for compressed and 8b if not)
4b for the Object[] (again it's just the reference)
Is this right so far?
To which  Peter L. wisely replied:
Yes. But as you recognise the sizes of the header and sizes of references are not known until runtime.

Option 4: Check...

So I used JOL to check. And as it turns out we are all somewhat right and somewhat wrong...
I'm right for compressed oops (the default for 64bit):
Running 64-bit HotSpot VM.
Using compressed references with 3-bit shift.
Client1 object internals:
      0     4          (object header)
      4     4          (object header)
      8     4          (object header)
     12     4   long[] Client1.padding
     16     8     long Client1.value
     24     4 Object[] Client1.o
     28     4          (loss due to the next object alignment)

The header is 12b and the array reference is shifted up to save on space. But my casual assumption 32bit JVM layout will be the same is wrong.

Simon is right that the header is 8b (but only for 32bit JVMs) and that references will go at the end (for both 32bit and 64bit, but not with compressed oops):
Running 32-bit HotSpot VM.
Client1 object internals:
      0     4          (object header)
      4     4          (object header)
      8     8     long Client1.value
     16     4   long[] Client1.padding
     20     4 Object[] Client1.o

And finally with 64bit Mr. Jee is right too:
Running 64-bit HotSpot VM.
Client1 object internals:
      0     4          (object header)
      4     4          (object header)
      8     4          (object header)
     12     4          (object header)
     16     8     long Client1.value
     24     8   long[] Client1.padding
     32     8 Object[] Client1.o

And Peter is entirely right to point out the runtime is the crucial variable in this equation.


If you catch yourself wondering about object layout:
  1. Use JOL to check, it's better than memorizing rules
  2. Remember that 32/64/64+Oops are different for Hotspot, and other JVMs may have different layouts altogether
  3. Read another post about java memory layout

Wednesday, 26 March 2014

Where is my safepoint?

My new job (at Azul Systems) leads me to look at JIT compiler generated assembly quite a bit. I enjoy it despite, or perhaps because, the amount of time I spend scratching my increasingly balding cranium in search of meaning. On one of these exploratory rummages I found a nicely annotated line in the Zing (the Azul JVM) generated assembly:
gs:cmp4i [0x40 tls._please_self_suspend],0jnz 0x500a0186
Zing is such a lady of a JVM, always minding her Ps and Qs! But why is self suspending a good thing?

Safepoints and Checkpoints

There are a few posts out there on what is a safepoint (here's a nice one going into when it happens, and here is a long quote from Mechnical Sympthy mailing list on the topic). Here's the HotSpot glossary entry:
A point during program execution at which all GC roots are known and all heap object contents are consistent. From a global point of view, all threads must block at a safepoint before the GC can run. (As a special case, threads running JNI code can continue to run, because they use only handles. During a safepoint they must block instead of loading the contents of the handle.) From a local point of view, a safepoint is a distinguished point in a block of code where the executing thread may block for the GC. Most call sites qualify as safepoints. There are strong invariants which hold true at every safepoint, which may be disregarded at non-safepoints. 
To summarize, a safepoint is a known state of the JVM. Many operations the JVM needs to do happen only at safepoints. The OpenJDK safepoints are global, while Zing has a thread level safepoint called a checkpoint. The thing about them is that at a safepoint/checkpoint your code must volunteer to be suspended to allow the JVM to capitalize on this known state.
What will happen while you get suspended varies. Objects may move in memory, classes may get unloaded, code will be optimized or deoptimized, biased locks will unbias.... or maybe your JVM will just chill for a bit and catch its breath. At some point you'll get your CPU back and get on with whatever you were doing.
This will not happen often, but it can happen which is why the JVM makes sure you are never too far from a safepoint  and voluntary suspension. The above instruction from Zing's generated assembly of my code is simply that check. This is called safepoint polling.
The safepoint polling mechanism for Zing is comparing a thread local flag with 0. The comparison is harmless as long as the checkpoint flag is 0, but if the flag is set to 1 it will trigger a checkpoint call (the JNZ following the CMP4i will take us there) for the particular thread. This is key to Zing's pause-less GC algorithm as application threads are allowed to operate independently.

Reader Safpoint

Having happily grokked all of the above I went looking for the OpenJDK safepoint.

Oracle/OpenJDK Safepoints

I was hoping for something equally polite in the assembly output from Oracle, but no such luck. Beautifully annotated though the Oracle assembly output is when it comes to your code, it maintains some opaqueness when it's internals are concerned. After some digging I found this:
test   DWORD PTR [rip+0xa2b0966],eax        # 0x00007fd7f7327000
                                                ;   {poll}
No 'please', but still a safepoint poll. The OpenJDK mechanism for safepoint polling is by accessing a page that is protected when requiring suspension at a safepoint, and unprotected otherwise. Accessing a
protected page will cause a SEGV (think exception) which the JVM will handle (nice explanation here). To quote from the excellent Alexey Ragozin blog:
Safepoint status check itself is implemented in very cunning way. Normal memory variable check would require expensive memory barriers. Though, safepoint check is implemented as memory reads a barrier. Then safepoint is required, JVM unmaps page with that address provoking page fault on application thread (which is handled by JVM’s handler). This way, HotSpot maintains its JITed code CPU pipeline friendly, yet ensures correct memory semantic (page unmap is forcing memory barrier to processing cores).
The [rip+0xa2b0966] addressing is a way to save on space when storing the page address in the assembly code. The address commented on the right is the actual page address, and is equal to the rip (Relative Instruction Pointer) + given constant. This saves space as the constant is much smaller than the full address representation. I thank Mr. Tene for clarifying that one up for me.
If we were to look at safepoints throughout the assembly of the same process they would all follow the above pattern of pointing at the same global magic address (via this local relative trick). Setting the magic page to protected will trigger the SEGV for ALL threads. Note that the Time To Safe Point (TTSP) is not reported as GC time and may prove a hidden performance killer for your application. The effective cost of this global safepoint approach goes up the more runnable (and scheduled) threads your application has (all threads must wait for a safepoint consensus before the operation to be carried out at the safepoint can start).

Find The Safpoint Summary

In short, when looking for safepoints in Oracle/OpenJDK assembly search for poll. When looking at Zing assembly search for _please_self_suspend.

Tuesday, 25 February 2014

Unsafe Pointer Chasing: Running With Scissors

Love running? Love scissors? I know just the thing for you! Following on from recent discussion on the Mechanical Sympathy mailing list I see an anti pattern worth correcting in the way people use Unsafe. I say correcting as I doubt people are going to stop, so they might as well be made aware of the pitfalls. This pattern boils down to a classic concurrency bug:

Q: "But... I not be doing no concurrency or nuffin' guv"
A: Using Unsafe to gain a view of on-heap addresses is concurrent access by definition.

Unsafe address: What is it good for?

Absolutely nothing! sayitagain-huh! I exaggerate, if it was good for nothing it would not be there, let's look at the friggin manual:
As we can see the behaviour is only defined if we use the methods together, and by that I mean that get/putAddress are only useful when used with an address that is within a block of memory allocated by allocateMemory. Now undefined is an important word here. It means it might work some of the time... or it might not... or it might crash your VM. Let's think about this.

Q: What type of addresses are produced by allocateMemory?
A: Off-Heap memory addresses -> unmanaged memory, not touched by GC or any other JVM processes

The off-heap addresses are stable from the VM point of view. It has no intention of running around changing them, once allocated they are all yours to manage and if you cut your fingers in the process or not is completely in your control, this is why the behaviour is defined. On-Heap addresses on the other hand are a different story.

Playing With Fire: Converting An Object Ref to An Address

So imagine you just had to know the actual memory address of a given instance... perhaps you just can't resist a good dig under the hood, or maybe you are concerned about memory layout... Here's how you'd go about it:
Now... you'll notice the object ref needs a bit of cuddling to turn into an address. Did I come up with such devilishly clever code myself? No... I will divulge a pro-tip here:
If you are going to scratch around the underbelly of the JVM, learn from as close to the JVM as you can -> from the JDK classes, or failing that, from an OpenJDK project like JOL (another Shipilev production)
In fact, the above code could be re-written to:
Now that we have the address what can we do with it? Could we use it to copy the object? maybe we could read or modify the object state? NO! we can but admire it's numerical beauty and muse on the temperamental values waiting at the other end of that address. The value at the other end of this address may have already been moved by GC...

Key Point: On-Heap Addresses Are NOT Stable

Consider the fact that at any time your code may be paused and the whole heap can be moved around... any address value you had which pointed to the heap is now pointing to a location holding data which may be trashed/outdated/wrong and using that data will lead to a funky result indeed. Also consider that this applies to class metadata or any other internal accounting managed by the JVM.
If you are keen to use Unsafe in the heap, use object references, not addresses. I would urge you not to mix the 2 together (i.e. have object references to off-heap memory) as that can easily lead to a very confused GC trying to chase references into the unknown and crashing your VM.

Case Study: SizeOf an Object (Don't do this)

This dazzling fit of hackery cropped up first (to my knowledge) here on the HighScalability blog:
This is some sweet macheta swinging action :-). The dude who wrote this is not suggesting it is safe, and only claims it is correct on a 32bit VM. And indeed, it can work and passes cursory examination. The author also states correctly that this will not work for arrays and that with some corrections this can be made to work for 64 bit JVMs as well. I'm not going to try and fix it for 64 bit JVMs, though most of the work is already done in the JOL code above. The one flaw in this code that cannot be reliably fixed is that it relies on the native Klass address (line 6) to remain valid long enough for it to chase the pointer through to read the layout helper (line 8). Spot the similarity to the volatile bug above?
This same post demonstrates how to forge references from on-heap objects to off-heap 'objects' which in effect let you cast a pointer to a native reference to an object. It goes on to state that is a BAD IDEA, and indeed it can easily crash your VM when GC comes a knocking (but it might not, I didn't try).

Case Study: Shallow Off-Heap Object Copy (Don't do this)

Consider the following method of making an off-heap copy of an object (from here, Mishadof's blog):
We see the above is using the exact same method for computing size as demonstrated above. It's getting the on-heap object address (limited correctness, see addresses discussion above) than copying the object off-heap and reading it back as a new object copy... Calling the Unsafe.copyMemory(srcAddress, destAddress, length) is inviting the same concurrency bug discussed above. A similar method is demonstrated in the HighScalability post, but  there the copy method used is Unsafe.copyMemory(srcRef, srcOffset, destRef, destOffset, length). This is important as the reference using method is not exposed to the same concurrency issue.
Both are playing with fire ofcourse by converting off-heap memory to objects. Imagine this scenario:
  • a copy of object A is made which refers to another object B, the copy is presented as object C
  • object A is de-referenced leading to A and B being collected in the next GC cycle
  • object C is still storing a stale reference to B which is no managed by the VM
What will happen if we read that stale reference? I've seen the VM crash in similar cases, but it might just give you back some garbage values, or let you silently corrupt some other instance state... oh, the fun you will have chasing that bugger down...


I don't mean to present either of the above post authors as fools, they are certainly clever and have presented interested findings for their readers to contemplate without pretending their readers should run along and build on their samples. I have personally commented on some of the code on Mishadof's post and admit my comments were incomplete in identifying the issues discussed above. If anything I aim to highlight that this hidden concurrency aspect can catch out even the clever.
Finally, I would be a hypocrite if I told people not to use Unsafe, I end up using it myself for all sorts of things. But as Mr. Maker keeps telling us "Be careful, because scissors are sharp!"

Thursday, 13 February 2014

When I say final, I mean FINAL!

Having recently bitched about the lack of treatment of final field as final I was urged by Mr. Shipilev to demonstrate the issue in a more structured way (as opposed to a drunken slurred rant), I have now recovered my senses to do just that. The benchmark being run and the queue being discussed are covered in this post, so please refresh you memory for context if you need. The point is clear enough without full understanding of the context though.
It is perhaps a fact well known to those who know it well that final fields, while providing memory visibility guarantees, are not actually immutable. One can always use reflection, or Unsafe, to store new values into those fields, and in fact many people do (and Cliff Click hates them and wishes them many nasty things). This is (I believe) the reason behind some seemingly trivial optimizations not being done by the JIT compiler.

Code Under Test: FFBufferWithOfferBatch.poll()

The buffer field is a final field of FFBufferWithOfferBatch and is being accessed twice in the method above. A trivial optimization on the JIT compiler side would be to load it once into a register and reuse the value. It is 'immutable' after all. But if we look at the generate assembly (here's how to, I also took the opportunity to try out JITWatch which is brilliant):
We can see buffer is getting loaded twice (line 15, and again at line 24). Why doesn't JIT do the optimization? I'm not sure... it may be due to the volatile load forcing a load order that could in theory require the 'new' value in buffer to be made visible... I don't know.

Hack around it, see if it makes a difference

Is that a big deal? Let's find out. The fix is trivial:
And the assembly code generated demonstrates the right behaviour now (one load at line 15):
Now, was that so hard to do? And more importantly, does it make any difference to performance? As discussed previously, the throughput benchmark is sensitive to changes in the cost balance between offer/poll. The optimization creates an interesting change in the pattern of the results:
The benchmark is run on Ubuntu13.10/JDK7u45/i7@2.4, the x axis is the index of the benchmark run and the Y axis is the result in ops/sec. The chart displays the results for before the change (B-*) and after(A-*) with different sparse data settings. We can see the change has accelerated the consumer, leading to increased benefit from sparse data that was not visible before. With sparse data set to 1 the optimization results in a 2% increase in performance. Not mind blowing, but still. The same change applied to the producer thread loop (localizing the reference to the queue field) discussed in the previous post enabled a 10% difference in performance as the field reference stopped the loop from unrolling and was read on each iteration. I used the poll() example here because it involves allot less assembly code to wade through.

Hopefully this illustrates the issue to Mr. Shipilev's content. Thanks goes to Gil Tene for pointing out the optimization to me and to Chris Newland for JITWatch.

Tuesday, 28 January 2014

Picking the 2013 SPSC queue champion: benchmarking woes

I put FFBufferWithOfferBatch together as part of looking into the performance of BQueue and what with all the excitement about JAQ I never introduced it properly. As it's the best performing SPSC(Single Producer Single Consumer) queue I have thus far I thought I should quickly cover it and tell you some horrific benchmarking tales of how I determined it's the best thus far.

Offer like a BQueue

The BQueue was previously presented and discussed here. Offer logic is simple and improves on the original FFBuffer by ensuring clearance to write ahead for the next OFFER_BATCH_SIZE elements. This improvement has 3 effects:
  1. Most of the time we can write without reading and checking for null. We do one volatile read for every OFFER_BATCH_SIZE elements, the rest of the time we compare to the cached field which is a normal field that can be held in a register.
  2. We avoid hitting the full queue induced contention by keeping a certain distance from the head of the queue. Contention can still happen, but it should be quite rare.
  3. We potentially miss out on part of the queue capacity as we will declare the queue full if there are less than OFFER_BATCH_SIZE available when we hit the cached value.
Overall this is not a bad deal.

Poll like an FFBuffer

This poll method is again quite simple. Offset hides the sparse data shift trick that is discussed in depth here. The sparse data trick is a way of reducing contention when nearing the empty queue state. With a sparse shift of 2 only 4 elements in the queue buffer actually share the same cache line (as opposed to 16 when data is not sparse) reducing the frequency with which that state occurs. This has a side effect of hurting memory throughput as each cache line we load has less data in it.

Benchmarks: A Misery Wrapped In An Enema

I've been benchmarking these queues over a long period of time and it is a learning experience (i.e: very painful). Benchmarking is just generally a complex matter, but when it comes to examining and comparing code where a single operation (offer/poll) is in the nano-seconds I quote Mr. Shipilev: "nano-benchmarks are the damned beasts". Now, consider a multi-threaded nano-benchmark...
Let's start with naming the contestants:
Now, the first benchmark I used to quantify the performance of these queues was the one I got from Martin's examples project. I stuck with it for consistency (some cosmetic changes perhaps), though I had some reservations about the section under measurement. Here's how the measured section goes:

The benchmark measures the time it takes for REPETITIONS number of 'messages' to be sent from the producer to the consumer. If the consumer find that the queue is empty it will Thread.yield and try again, similarly if the producer finds that the queue is full it will Thread.yield and try again.
Here's how the queues perform with that benchmark (running on Ubuntu13.10/JDK7u45/i7-4300MQ@2.4 no turbo boost, pinned across cores using taskset, with JVM options: '-server -XX:+UseCondCardMark -XX:CompileThreshold=100000'):
  CAQ(sparse=0) - 130M ops/sec
  FF1(sparse=0) - 155M ops/sec
  FF2(sparse=0) - 185M ops/sec
  CAQ(sparse=2) - 288M ops/sec
  FF1(sparse=2) - 282M ops/sec
  FF2(sparse=2) - 290M ops/sec

When I started on JAQ I took the same benchmark, but changed a few things. I moved the start timestamp to the producing thread and moved the thread join out of the measured section, I also added some counters for when the producer fails to offer (queue is full) and the consumer fails to poll (queue is empty). Finally I took this same benchmark and ported it to use my very own ConcurrentQueue interface. Here's how it looked:

You would think this sort of change could only make performance slightly worse, if it made any difference at all. Certainly that's what I thought, but I was wrong... With the new benchmark I got the following results:
  FF2(sparse=0) - 345M ops/sec
  FF2(sparse=2) - 327M ops/sec

WTF? What is the difference?

I looked long and hard at the benchmark and realized the only difference, that shouldn't really make a difference, but the only difference that does make a difference is the localization of the queue field to the producer loop (found by process of elimination). Tweaking the original benchmark to localize the queue reference gives us different results for all queues:
  CAQ(sparse=0) - 291M ops/sec
  FF1(sparse=0) - 190M ops/sec
  FF2(sparse=0) - 348M ops/sec
  CAQ(sparse=2) - 303M ops/sec
  FF1(sparse=2) - 287M ops/sec
  FF2(sparse=2) - 330M ops/sec

So we notice things have not improved much for FF1, but for CAQ we now have only a marginal difference between using sparse data and not using it, and for FF2 it is actually better not to bother at all with sparse data. The localization of the queue reference made the producer faster, reducing the number of times the empty queue state is hit and thus reducing the need for sparse data. We can try to validate this claim by running the Queue variation of this benchmark with the counters. With the reference localized:
  CAQ(sparse=0) - 291M ops/sec, poll fails 350,    offer fails 0
  FF1(sparse=0) - 192M ops/sec, poll fails 32000,  offer fails 0
  FF2(sparse=0) - 348M ops/sec, poll fails 150,    offer fails 13000
  CAQ(sparse=2) - 303M ops/sec, poll fails 270,    offer fails 0
  FF1(sparse=2) - 287M ops/sec, poll fails 200,    offer fails 0
  FF2(sparse=2) - 330M ops/sec, poll fails 170,    offer fails 10

So we can see that adding the extra counters made little difference with the reference localized, but when referencing the field directly in the producer loop:
  CAQ(sparse=0) - 167M ops/sec, poll fails 2400,   offer fails 0
  FF1(sparse=0) - 160M ops/sec, poll fails 100000, offer fails 0
  FF2(sparse=0) - 220M ops/sec, poll fails 2000,   offer fails 0
  CAQ(sparse=2) - 164M ops/sec, poll fails 31000,  offer fails 0
  FF1(sparse=2) - 250M ops/sec, poll fails 2000,   offer fails 0
  FF2(sparse=2) - 255M ops/sec, poll fails 500,    offer fails 0

We get a different picture again, in particular for CAQ. To get to the bottom of the exact effect localizing the reference had on this benchmark I will have to dig into the generated assembly. This will have to wait for another post...

Conclusions And Confusions

The overall winner, with all the different variations on the throughput benchmarks, is the FFBufferWithOfferBatch. It's also the winner of the previously presented/discussed latency benchmarks(part 1, part 2). With turbo boost on it hits a remarkable high of 470M ops/sec. But setting this result to one side, the above results highlight a flaw in the throughput benchmark that is worth considering in the context of other concurrent benchmarks, namely that changing the [queue] implementation under test can change the benchmark.
Let me elaborate on this last point. What I read into the results above is that the benchmark was initially a benchmark where the queue full state was never hit, and the queue empty state was hit more or less depending on the queue implementation. Given FF2 is only a change to the offer method of FF1 we can see how tilting the balance between the offer cost and poll cost changed the nature of the test. In particular when using no sparse data the producer turned out to be significantly faster than the consumer. But... in changing the implementation we have unbalanced the benchmark, it is hardly ever hitting an empty queue, which would slow down the consumer/producer threads. We have switched to measuring the full speed of consumption as a bottleneck, but only for one queue implementation. So this is, in a way, not the same benchmark for all queues.
While this benchmark is still useful to my mind, it is worth considering that it is in fact benchmarking different use cases for different queues and that the behaviour of the using code will inevitably balance the producer/consumer costs quite differently. All in all, yet another case of your mileage may vary ;-)

Thanks to Peter, Darach and Mr. Gomez

Thursday, 16 January 2014

JAQ: Using JMH to Benchmark SPSC Queues Latency - Part II

Just quickly sharing more results/insights from running different configuration of the SPSC latency benchmark discussed in the previous post. The previous post reviewed the different implementations behaviour when sending different sizes of bursts, in this post I'll have a look at the impact of the length of the chain (number of threads passing the messages from one to the other returning to the originator) on latency. The benchmarks and queues stay the same, so you may have to skip back to the previous post for context.


This post may not be of great popular appeal... it's a verification of the behaviour of current implementations of SPSC in terms of single threaded operation costs and RTT latency for different 'trip' lengths. The main finding of interest (for me) is that the sparse data method previously discussed here is proving to be a hindrance to latency performance. This finding is discussed briefly at the end and is something I may dig into more deeply in the near future. Other more minor nuggets were found... but anyhow, lets get to it. 

Chain length 1: The cost of talking to yourself

When I had Darach review the original post he pointed out to me that I neglected to cover this edge case in which a queue is being used by a single thread acting as both the consumer and the producer. The result is interesting as a baseline of operation costs when no concurrency is involved. The benchmark is very simple:

The results give us a sense of the cost of [(offer * burst size) + (poll * burst size)] for the different implementations. I added one of my favourite collections, the ArrayDequeue, to compare the efficiency of a non-thread safe queue implementation with my furry gremlins. Here's the score for AD vs CLQ for the different burst sizes:

This is a bit of a bland presentation, so feel free to call me names (you bland representor of numbers!). The numbers are here if you want to play. All the numbers are in nanos per op, mean error in brackets. Benchmark run on laptop(Ubuntu13.10/JDK7u45/i7-4700MQ@2.4 capped). Here's the score for CAQ/FF2 with sparse shift set to 0 and 2:

Points for consideration:
  • CLQ (ConcurrentLinkedQueue) is the worst. This is completely the wrong tool for the job which is why this result should be a lesson in why using concurrent data structures where you don't need them will suck.
  • Overall AD(ArrayDequeue) is the best from burst size 4 and up. This is the right tool for the job when your consumer and producer are the same thread. Choose the right tool, get the right result.
  • CAQ(ConcurrentArrayQueue) and FF2(FFBufferWithOfferBatch) start off as similar or better than AD, but soon fall behind. Interestingly FF2 starts as the winner of the 3 and ends as the loser with CAQ overtaking it from burst size 64 and onward.

I found it interesting to see how the behaviour changed as the burst size changes, and I thought the single threaded cost estimates were valuable as enablers of other estimates, but overall this is not a benchmark to lose sleep over.
Note the CAQ/FF2 are measured in 2 configurations with and without sparse data. The use of sparse data is counter productive in this use case as it aims to reduce contention that is not there, and inhibits throughput in the process.

Stable Chains: Working across the cores

Due to the limits of my benchmarking machine I can only demonstrate stable behaviour for chains of length 1 to 4 without resorting to pinning threads to cores. Still, the results offer some insight. The benchmark in use is the one discussed previously here. The whole process is pinned such that the only available inter-thread channel is across cores. On my machine I have a hyper-threaded quad-core CPU so 8 logical cores from a taskset perspective. The even/odd numbers are on different cores. For my runs I pinned all other processes to core 0 and pinned the JMH process to different cores (as many as I needed from 1,3,5,7). This is not ideal, but is workable.
I didn't bother with CLQ for this set of results. Just CAQ and FF2. The benchmark code is here.
Here's chain length 2 (T1 -> T2 -> T1):

Here's chain length 3 (T1 -> T2 -> T3 -> T1):

Here's chain length 4 (T1 -> T2 -> T3 -> T4 -> T1):

This is data collected from (10 warmup iterations + 10 iterations) * 10 forks, so benchmarking took a while. This should mean the data is slightly more than anecdotal...

  • FF2 is consistent proving to be the best queue in terms of inter-thread latency. This advantage is maintained for all burst sizes and chain lengths.
  • Sparse data is only marginally improving results consistently for burst size 1. After that it seems to have a small positive effect on CAQ with chain length 3/4 with small bursts. This may be a quirk of the use case demonstrated by the benchmark but it at least shows this method has a demonstrable down side.
  • The burst RTT does not grow linearly with the burst size, in particular for the smaller bursts. This is due to the bursts filling up the time-to-notify period with write activity.
Here's the latency for FF2(sparse=0) across all chains:

  • Single threaded cost grows linearly with size, this is how we are used to think about cost. It's also a lower boundary for the round trip use case as the round trip requires at least the amount of work done in the single thread case to still be done in the measuring/source thread.
  • Once we increase the chain length we get this initial plateau steadily increasing in slope to become linear again. Note that for the 2 threaded case the costs are nearly the same as the single thread case from burst size 512.

Where to next?

I've had a long break over Xmas, some progress was made on JAQ, but not as much as I'd have liked... I actually spent most of the time away from a keyboard. There are now MPSC/SPMC implementations, and a direct ConcurrentQueue implementation for SPSC hooked into the factory and an MPSC one nearing completion. I've had some interest from different people and am working towards meeting their requirements/needs... we'll get there eventually :-)

Thanks Norman M. for his review (he did tell me to add graphs... my bad)

Friday, 13 December 2013

JAQ: Using JMH to Benchmark SPSC Queues Latency - Part I

As part of the whole JAQ venture I decided to use the excellent JMH (Java Microbenchmark Harness see: project here, introduction to JMH here, and multi-threaded benchmarking here). The reason I chose JMH is because I believe hand rolled benchmarks limit our ability to peer review and picking a solid framework allows me and collaborators to focus on the experiment rather than the validity of the harness. One of the deciding factors for me was JMHs focus on multi-threaded benchmarks, which is something I've not seen done by other frameworks and which forms it's own set of problems to the aspiring benchmarker. It's not perfect, but it's the closest we have.
Anyways... that's the choice for now, lets jump into some benchmarks!

Using the JMH Control

To benchmark the latency of 2 threads communicating via a queue requires using the JMH Control structure. This is demonstrated in the JMH samples measuring thread to thread "ping pong" via an AtomicBoolean.compareAndSwap (see original here):

The challenge here is that the CAS operation may fail, and when it does we would want to retry until we succeed, BUT if the opposite thread were to be stopped/suspended because the benchmark iteration is done we would be stuck in an infinite loop. To avoid this issue a benchmarked method can have a Control parameter passed in which exposes a flag indicating the end of a run, this flag can be used to determine benchmark termination and thus avoid the above issue. You'll notice other JMH annotations are in use above:
  • State: sets the scope of the benchmark state, in this case the Group
  • Group: multi-threaded benchmarks can utilize thread groups of different sizes to exercise concurrent data structure with different operations. In the case above we see one group calls pingpong. Both methods are exercised by threads from the group, using the default of 1 thread per operation.
  • Note that if we were to run the above benchmark with more threads, we will be running several independent groups. Running with 10 threads will result in 5 concurrent benchmark runs each with it's own group of 2 threads.
I pretty much copied this benchmark into the jaq-benchmarks and refer to it as a baseline for inter-thread message exchange latency. The results for this benchmark are as follows (running on the same core):
Benchmark                                  Mode Thr     Count  Sec         Mean   Mean error    Units
i.j.s.l.BaselinePingPong.pingpong          avgt   2        10    1       23.578        0.198    ns/op
i.j.s.l.BaselinePingPong.pingpong:ping     avgt   2        10    1       23.578        0.198    ns/op
i.j.s.l.BaselinePingPong.pingpong:pong     avgt   2        10    1       23.578        0.198    ns/op

And running across different cores:
Benchmark                                  Mode Thr     Count  Sec         Mean   Mean error    Units
i.j.s.l.BaselinePingPong.pingpong          avgt   2        10    1       60.508        0.217    ns/op
i.j.s.l.BaselinePingPong.pingpong:ping     avgt   2        10    1       60.496        0.195    ns/op
i.j.s.l.BaselinePingPong.pingpong:pong     avgt   2        10    1       60.520        0.248    ns/op

The way I read it CAS is sort of a round trip to memory, reading and comparing the current value, and setting it to a new value. You could argue that it's only a one way trip, and I can see merit in that argument as well. The point is, these are the ballpark figures. We shouldn't be able to beat this score, but if we manage to get close we are doing well.
Note from a hardware point of view: See this answer on Intel forums on cache latencies. Latencies stand at roughly:
"L1 CACHE hit, ~4 cycles
 L2 CACHE hit, ~10 cycles
 L3 CACHE hit, line unshared ~40 cycles
 L3 CACHE hit, shared line in another core ~65 cycles
 L3 CACHE hit, modified in another core ~75 cycles
 remote L3 CACHE ~100-300 cycles
 Local Dram ~60 ns
 Remote Dram ~100 ns"
To convert cycles to time you need to divide by your CPU clock speed, the delay manifestation will also be dependant on whether or not you require the data to make progress etc. The thing to keep in mind here is that there is a physical limit to the achievable latency. If you think you're going faster... check again ;-). 

Queue round trip latency benchmark

Queues are not about exchanging a single message, they are about exchanging many messages asynchronously (at least that's what the queues I care about are about). Exchanging a single message is a worst case scenario for a queue, there's no gain to be had by batching, no false sharing to avoid, nothing much to cache, very little cost to amortize. It is therefore a valuable edge case, and an important one for many real world applications where latency needs to be minimized for bursty traffic/load. To put my queues to the test I devised a round trip benchmark which allows us to chain a configurable number of threads into a ring and measure the round trip latency. This should help magnify the exchange costs and perhaps expose behaviours not so easily detected in the minimal case. Here's an outline of the functionality under test given a chain length N (which requires N queues and N threads including the measuring thread) and a burst of K messages:
  1. From thread[0] send K objects down queue[0].
  2. Concurrently: In each thread K(0<K<N) read from queue[K-1] and write to queue[K].
  3. Thread[0] waits for K objects to be polled from queue[N-1] and returns.
There were several challenges/intricacies to be considered here:
  • The number of benchmarking threads is unknown at runtime (in JMH), so chain length must be defined separately.
  • Setup/Teardown methods on the Iteration level are called per thread.
  • Thread state is allocated and used per thread per iteration. The threads are pooled, so a given thread can switch 'roles' on each iteration. In other words @State(Scope.Thread) is not ThreadLocal.
  • Concurrent polls to SPSC queues can leave the queue in an undefined state.
  • The benchmark requires that the queues are empty when we start measuring, so at the beginning or end of an iteration we must clear the queues.
  • Queues must be cleared from the consumer thread.
So things are not that simple, and the code (full code here) is an attempt to cover all these intricacies:

I'm relying on correct usage for some things, but JMH is basically doing most of the heavy lifting:
  • JMH is taking care of all the measurements/measurement modes/reporting/formatting including forked runs and stats summaries and all.
  • JMH is taking care of thread management, no need to spin up threads/executors and shut them down.
  • JMH is synchronizing iterations and managing per thread state setup/teardown. Note the @TearDown annotation in Link/Source.
  • JMH is allowing the group thread balance to be parametrized (just recently) which allows generic asymmetric multi-threaded benchmarks like the above.
Where I have to work around the framework:
  • I would like to only specify the chain.length and derive the thread allocation from that. This is possible via the JMH launcher API.
  • I would like @State(Scope.Thread) to mean ThreadLocal.
  • I would like to be able to pass State down the stack as constructor parameters, so I could avoid the static field for sharing benchmark state between thread state objects.
I'd be lying if I said writing this benchmark the JMH way was easier than hacking something together, it take getting used to and some of the features I required are fairly late additions. But the benefit is worth it.
I can't stress this point enough, the objective here is more than collecting data, it is about enabling, supporting and empowering peer review. I want the benchmarks to be as simple as they can be so people can either be easily convinced or easily point out my errors. The less code I need my peers to read, the easier it is for anyone to build/run/tweak the benchmarks, the less we rely on handrolled/halfassed attempts at benchmark frameworks THE BETTER (having said that, you got to do what you got to do). 
Note: If you care about latency you should watch Gil Tene's excellent talk on How Not To Measure Latency, where he discusses coordinated omission and it's impact on latency measurement. As you may have noticed this experiment suffers from coordinated omission.  I'll have to live with that for now, but I don't feel the data gained from a corrected measurement would impact the queue implementations in any form.

Hard Numbers? Numbers Hard?

We've got the following queues:
  1. ConcurrentLinkedQueue (CLQ) - Good old JDK lock free queue, for reference. 
  2. InlinedCountersSpscConcurrentArrayQueue (CAQ) - My inlined counter variation on Martin Thompson's ConcurrentArrayQueue previously discussed/explained/explored here and compared to this variation here. It can be run with sparse data as described here.
  3. FFBuffer (FF1) - A Java port of the FastFlow SPSC implementation with sparse data as described here.
  4. BQueue (BQ)- An SPSC previously discussed here which introduces an offer side batch of the null check (which also serves as a full queue induced false sharing protection) and a poll side batching of emptiness testing which also includes a spinning backoff to avoid the near empty queue issue.
  5. FFBufferWithOfferBatch (FF2) - A BQueue on the offer side and an FFBuffer on the poll side, this is a new addition to the arsenal.
JMH gives us 2 relevant benchmark modes for the above use case:
  1. AverageTime - Average time per operation (from the command line use '-bm avgt')
  2. SampleTime - Time distribution, percentile estimation (from the command line use '-bm sample')
And we could also look at the same core vs. cross core latency, vary the length of the chain and examine the effect of sparse data on latency. This is a bit of an explosion of data, too much for one post, but lets dip our toes...
I'll start with the average RTT latency for each type, running same core and cross core:

The benchmarks were run on my laptop (JDK7u45/Ubuntu13.10/i7-4700MQ@2.40GHz no turbo boost/benchmark process pinned using taskset) and though I ran them repeatedly I didn't systematically collect run to run variance data, so consider the results subject to some potential run to run variance.
As we can see:
  • FF2/FF1 are very similar and take the cake at 60/230ns (FF2 is marginally faster).
  • BQ is the worst [392/936ns] because of its backoff on approaching near empty strategy.
  • It is perhaps surprising to see that although CAQ/FF1/BQ show very similar throughput in my previous tests their latency characteristics are so distinct.
  • Note that CLQ is actually not too bad latency wise (in this particular use case). 
Lets have a look at what happen when the burst size increases[I dropped FF1 out of the race here, too similar to FF2, this is running across cores]:
We can see that:
  • CLQ makes no effort to amortize costs as its RTT grows with the burst size.
  • BQ starts off as the worst, but somewhere between burst size 16 and 32 the benefits of it's design start to show.
  • FF2 and CAQ demonstrate similar growth patterns with the cost per item dropping as the burst size grows. But FF2 remains the clear winner, at burst size 32 the round trip is 640ns.
  • 640ns for 32 messages averages at just 20ns per message. This is NOT A MEANINGFUL LATENCY NUMBER! :) 

Latency != Time/Throughput

Let me elaborate on this last point. The average cost is not a good representative of the latency. Why not? latency in this case is the round trip time from each message perspective, throughput in a concurrent environment is not an indicator of latency because it assumes a uniformity that is just not there in real life. We know the behaviour for the first message is at best the same, so assuming first message latency is 230ns, how would the other messages timeline need to look to make the mean latency 20ns? is that reasonable? What is really happening here is as follows (my understanding in any case):
  • Producer pretty much writes all the messages into the queue in one go within a few nano-seconds of first and last message being written. So if we mark T1..T32 the send time of the messages, we can estimate T32 - T1 is within 100ns - 150ns (single threaded read/write to FF2 takes roughly 6ns, assume an even split between write and read and add some margin). Near empty queue contention can add to this cost, or any direct contention between poll/offer.
  • In fact, assuming a baseline cost of 6ns read/write, multiplied by chain length, leads to a base cost of 32*6*2=384ns. Following from this is that the minimum latency experienced by any message can never be less than 32*3=96ns as the burst must be sent before messages are received.  96ns is assuming zero cost for the link, this is how fast this would go offering and polling from the same queue in the same thread.
  • With any luck, and depending on the size of the burst, the consumer will see all the messages written by the time it detects data in the queue (i.e. it won't contend with the producer). There is opportunity here for reduced cost which is used by FF2/CAQ and BQ.
  • The consumer now reads from the queue, the first message is detected like it would for burst size 1, i.e after 115ns or so for FF2. This can only get worse with the contention of potentially other messages being written as the first message is read adding further latency. The rest of the messages are then read and there is again an opportunity to  lower costs. The messages are written into the outbound queue as soon as they are read. Reading and writing each message has a cost, which we can assume is at the very least as high as 6ns.
  • The producer now detects the messages coming back. Let's mark the return times for the messages R1-R32. We know T1 - R1 is at least 230ns and we know that T1 - R32 is roughly 640ns. Assuming some fixed costs for writing/reading from the queue, we can estimate that R32-R1 is also within 100-150ns.
  • So, working through the timeline as estimated above gives us the following best case scenario for message 32:
    • 3ns     - Source sends message 1
    • 96ns   -  Source sends message 32
    • 115ns - Link picks up message number 1, has to work through messages 1-31
    • 230ns - Source picks up message 1
    • 301ns - Link is done transferring 31 messages and is finally picking up message 32
    • 307ns - Link has sent message 32
    • 346ns - Source has finished processing message 32
  • In this hypothetical best case scenario, message 32 which had the best chance of improved latency enjoyed a 150ns RTT. This is the theoretical best case ever. IT DIDN'T HAPPEN. The RTT we saw for a 32 messages burst was 640ns, indicating some contention and an imperfect world. A world in which there's every chance message 32 didn't have a latency as good as message 1.
  • The fallacy of the average latency is that it assumes a uniform distribution of the send and receive times, but the reality is that the groups are probably lumped together on a timeline leading to many send times being within a few ns of each other, and many receive times being within a few ns of each other. This is the magic of buffering/queuing.
  • The latency experienced by the last element in a burst can be better than the latency experienced by the first, but not significantly so and subject to some very delicate timing. The important thing here is that good queues allow the producer to make progress as the consumer idles or is busy consuming previous items, thus allowing us to capitalize on concurrent processing.

To Be Continued...

There's allot of experimentation and data to be had here, so in the interest of making progress with the project as well as sharing the results I'll cut this analysis short and continue some other time. To be explored in further posts:
  • Compare sparse data impact on CAQ/FF2 for the bursty use case - I have done some experimentation which suggests FF2 gains little from the sparse data setting, while for CAQ the improvements are noticeable for certain burst sizes.
  • Compare average time results with the sampling mode results - There are significant differences in the results, which I'm still trying to explain. One would expect the sampling mode to have some fixed overhead of having to call System.nanoTime() twice for each sample, but the difference I see is an x2 increase in the mean.
  • Compare queue behaviour for different chain lengths - The results are unstable at the moment because the thread layout is not fixed. For chain length 2-4 the topography can be made static, but from chain size 5 and up the number of cross core crossings can change in mid run (on my hyper threaded quad core anyways), leading to great variance in results (which is legitimate, but annoying). I will need to pin threads and thread 'roles' to stabilize the benchmark for this use case, probably using JTA (Java Thread Affinity - by Sir Lawrey).
The overall result is pointing to FF2 as the clear winner of the fastest queue competition, but the variety is educational and different designs are useful for different things. This also strengthens my conviction that a requirement based factory for queues serves your code much better than naming the queue flavour you feel like as you code. All the more reason to invest more time into JAQ :-).

Thanks to Darach, Norman, and Peter for reviewing the post, and Aleksey for reviewing the benchmark. Peer review FTW!