Monday, 21 April 2014

Notes On Concurrent Ring Buffer Queue Mechanics

Having recently implemented an MPMC queue for JAQ (my concurrent queue collection) I have been bitten several times by the quirks of the underlying data structure and it's behaviour. This is an attempt at capturing some of the lessons learnt. I call these aspects or observations 'mechanics' because that's how they feel to me (you could call them Betty). So here goes...

What is a Ring Buffer?

A ring buffer is sometimes called a circular array or circular buffer (here's wiki) and is a pre-allocated finite chunk of memory that is accessed in a circular fashion to give the illusion of an infinite one. The N+1 element of a ring buffer of size N is the first element again:

So the first mechanical aspect to observe here is how the circularity is achieved. I'm using a modulo operator above, but you could use a bitwise AND instead if queue size was a power of 2 (i%(2^k) == i&((2^k) -1)). Importantly, circularity implies 2 different indexes can mean the same element.
Trivial stuff, I know, moving right along.

A Ring Buffer Based Bounded Queue: Wrap and Bounds

Ring buffers are used to implement all sorts of things, but in this particular case let us consider a bounded queue. In a bounded queue we are using the underlying buffer to achieve re-use, but we no longer pretend the capacity is infinite. We need to detect the queue being full/empty and track the next index to be offered/polled. For the non-thread safe case we can meet these requirements thus:

As an alternative we can use the data in the queue to detect the full/empty states:

The next mechanical aspect we can notice above is that for a queue to work we need to stop the producer from wrapping around and overwriting unread elements. Similarly we need the consumer to observe the queue is empty and stop consuming. The consumer is logically 'chasing' the producer, but due to the circular nature of the data structure the producer is also chasing the consumer. When they catch up to each other we hit the full/empty state.

The Single Producer/Consumer case: Visibility and Continuity

In the  SPSC (single producer/consumer threads, no more) case life remains remarkably similar to the non-thread safe case. Sure we can optimize memory layout and so on, but inherently the code above works with very minor changes. You will notice in the following code that I'm using a lovely hungarian notation of my own design, lets get the full notation out of the way:
  • lv - load volatile (LoadLoad barrier)
  • lp - load plain
  • sv - store volatile (LoadStore barrier)
  • so - store ordered (StoreStore barrier), like lazySet
  • sp - store plain
  • cas - compare and swap
lv/so for the counters could be implemented using an AtomicLong, an AtomicFieldUpdater or Unsafe. Here goes:
And the alternative (lvElement/soElement could be implemented using Unsafe or by replacing the original buffer with an AtomicReferenceArray):

All we needed to add are appropriate memory barriers to enforce visibility and ordering. The mechanical observation to be made regarding the barriers is that correct publication is a product of ordered visibility. This is very prominent in the counter based approach where the counter visibility derives data visibility in the queue. The second approach is slightly more subtle but the ordered write and volatile read guarantee correct publication of the element contents. This is an important property of concurrent queues that elements are not made visible to other threads before preceding writes to their contents.
I've added an optimization on the alternative offer that highlights a mechanical property of the SPSC and non-thread safe cases. The offer is probing ahead on the buffer, if the "tail + look ahead constant" element is null we can deduce that all the elements up to it are also null and write without checking them. This property of the queue is the continuity of data existence. We expect the ring buffer to be split to an all empty section and an all full section and we expect no 'holes' in either.

The MPSC case: Atomicity and Holes in the fabric of Existence

So now we have multiple threads hitting the queue. We need to ensure the blighters don't over-write the same slot and so we must increment head atomically before writing to the queue:

An interesting thing happens on the consumer side here. We can no longer rely on the tail counter for data visibility because the increment is done before the data is written. We must wait for the data to become visible and can't just assume it is there as we did before. This highlights the fact that tail is no longer indicative of producer progress.
The alternative poll method in this case cuts straight to the issue:

I will not show the code for the SPMC case, it is very similar, but one point is worth examining. For the SPMC the offer method can no longer employ the probe ahead optimization shown above. That is because the continuity property is no longer true (a real shame, I liked it allot). Consider 2 consumer threads where one has progressed the danger point where head will be visible but not the data, and the other charging ahead of it. The slot is not null and remains so until the thread resumes. This means the empty section of the queue now has a hole (not null element) in it... making the probe ahead optimization void. If we were to keep the optimization the producer would assume the coast is clear and may overwrite an element in the queue before it is consumed.
For both MPSC/SPMC and also MPMC we can therefore observe that counter increment atomicity does not imply queue write atomicity. We can also see that this scheme has no fairness of counter acquisition or slot use so it is possible to have many producers/consumers stuck while others make progress. For example, given 3 producers A, B and C we can have the queue fill up such that the slots are claimed to the effect of: [ABBBBBBBCAAAAACCCABABACCCC...] or any such random layout based on the whims of the scheduler and CAS contention.

The MPMC case: What Goes Around

So finally all hell breaks loose and you have multiple producers and consumers all going ape pulling and pushing. What can we do? I ended up going with the solution put forward by Mr. D. Vyukov after I implemented a few flawed variations myself (an amusing story to be told some other time). His solution is in C and benefits from the memory layout afforded by languages with struct support. I had to mutate the algorithm (any bugs are my own) to use 2 arrays instead of one struct array but otherwise the algorithm is the very similar:

So... what just happened?
  • We're setting up each slot with a sequence
  • A producer can only write to the slot if the sequence matches tail and they won the CAS
  • A consumer can only read the slot if the sequence is head + 1 and they won the CAS
  • Once a producer writes a slot he sets it's sequence to tail + 1
  • Once a consumer reads a slot she sets the sequence to head + buffer.length
Why can't we rely on the head/tail anymore? well... the head/tail values were half useless before as pointed out in the MPSC section because they reflect the most advance consumer/producer and cannot indicate data state in the queue.
Can't we use the null/not-null check like before? mmm... this one is a bugger. The surprising problem here is producers catching up with other producers after wrapping and consumers doing the same to other consumers. Imagine a short queue, 2 slots only, and 2 producer threads. Thread one wins the CAS and stalls before writing slot 1, thread 2 fills second slot and comes round to hit slot 1 again, wins the CAS and either writes over thread 1 or gets written over by thread 1. They can both see the slot as empty when they get there.
A solution relying on counters exists such that it employs a second CAS on the data, but:

  1. It is slower, which is to be expected when you use 2 CAS instead of one
  2. It runs the risk of threads getting stuck to be freed only when the other threads come full circle again. Think of a producer hitting another producer on the wrap as discussed before and then one wins the CAS on data and the other is left to spin until the slot is null again. This should be extremely rare (very hard to produce in testing, possible by debugging to the right points), but is not a risk I am comfortable with.

I'm hoping to give concrete examples broken code in a further post, but for now you can imagine or dig through the commit history of JAQ for some examples.
The sequence array is doubling our memory requirement (tripling it for 32bit/compressed oops). We might be able to get by with an int array instead. The solution works great in terms of performance, but that is another story (expect followup post).
The important observation here on the mechanical side is that for MPMC both head and tail values are no longer reliable means of detecting wrap and as such we have to detect wrap by means other than head/tail counters and data existence.

Summary

  • Circular/ring array/buffer give the illusion of infinite arrays but are actually finite.
  • Bounded queues built on ring buffers must detect queue full/empty states and track head/tail positions.
  • Ring buffers exhibit continuity of existence for the full/empty sections ONLY in the SPSC or single threaded case.
  • MPSC/SPMC/MPMC queues lose continuity, can have holes.
  • Counter increment atomicity does not imply write atomicity.
  • MP means tail is no longer a reliable means of ensuring next poll is possible.
  • MC means head is no longer a reliable mean of ensuring next offer is possible.
  • MPMC implementations must contend with producer/producer and consumer/consumer collisions on wrap.
I'm publishing this post in a bit of a rush, so please comment on any inaccuracies/issues/uncertainties and I shall do my best to patch/explain if needed. Many thanks go out to Martin Thompson, Georges Gomez, Peter Hughes and anybody else who's bored of hearing my rambles on concurrent queues.

Friday, 28 March 2014

Java Object Layout: A Tale Of Confusion

Buy the T-shirt
Following the twists and turns of the conversation on this thread in the Mechanical Sympathy mailing list highlights how hard it is to reason about object layout based on remembered rules. Mostly every person on the thread is right, but not completely right. Here's how it went...

Context

The thread discusses False Sharing as described here. It is pointed out that the padding is one sided and padding using inheritance is demonstrated as solution. The merits of using inheritance vs. using an array and utilizing Unsafe to access middle element (see Disruptor's Sequence) vs. using AtomicLongArray to achieve the same effect are discussed (I think the inheritance option is best, as explored here). And then confusion erupts...

What's my layout?

At this point Peter L. makes the following point:
[...]in fact the best option may be.
class Client1 {
    private long value;
    public long[] padding = new long[5];
}
What follows is a series of suggestions on what the layout of this class may be.

Option 1: My recollections

I was too lazy to check and from memory penned the following:
[...] the ordering of the fields may end up shifting the reference (if it's 32bit or CompressedOop) next to the header to fill the pad required for the value field. The end layout may well be:
12b header
4b padding(oop)
8b value

Option 2: Simon's recollections

Simon B. replied:
[...] I thought Hotspot was laying out longs before
references, and that the object header was 8 Bytes.
So I would expect Client1 to be laid out in this way:
8B header
8B value
4B padding (oop)
[...] Am I out of date in object layout and header size ?

Option 3: TM Jee's doubts

Mr. Jee slightly changed the class and continued:
for:
class Client1 {
    private long value;
    public long[] padding = new long[5]
    public Object[] o = new Object[1];
}
the memory layout should be something like
12b header (or is it 16b)
8b value
4b for the long[] (its just the reference which is 4b for compressed and 8b if not)
4b for the Object[] (again it's just the reference)
Is this right so far?
To which  Peter L. wisely replied:
Yes. But as you recognise the sizes of the header and sizes of references are not known until runtime.

Option 4: Check...

So I used JOL to check. And as it turns out we are all somewhat right and somewhat wrong...
I'm right for compressed oops (the default for 64bit):
Running 64-bit HotSpot VM.
Using compressed references with 3-bit shift.
Client1 object internals:
 OFFSET  SIZE     TYPE DESCRIPTION
      0     4          (object header)
      4     4          (object header)
      8     4          (object header)
     12     4   long[] Client1.padding
     16     8     long Client1.value
     24     4 Object[] Client1.o
     28     4          (loss due to the next object alignment)

The header is 12b and the array reference is shifted up to save on space. But my casual assumption 32bit JVM layout will be the same is wrong.

Simon is right that the header is 8b (but only for 32bit JVMs) and that references will go at the end (for both 32bit and 64bit, but not with compressed oops):
Running 32-bit HotSpot VM.
Client1 object internals:
 OFFSET  SIZE     TYPE DESCRIPTION
      0     4          (object header)
      4     4          (object header)
      8     8     long Client1.value
     16     4   long[] Client1.padding
     20     4 Object[] Client1.o

And finally with 64bit Mr. Jee is right too:
Running 64-bit HotSpot VM.
Client1 object internals:
 OFFSET  SIZE     TYPE DESCRIPTION
      0     4          (object header)
      4     4          (object header)
      8     4          (object header)
     12     4          (object header)
     16     8     long Client1.value
     24     8   long[] Client1.padding
     32     8 Object[] Client1.o

And Peter is entirely right to point out the runtime is the crucial variable in this equation.

Lesson?

If you catch yourself wondering about object layout:
  1. Use JOL to check, it's better than memorizing rules
  2. Remember that 32/64/64+Oops are different for Hotspot, and other JVMs may have different layouts altogether
  3. Read another post about java memory layout

Wednesday, 26 March 2014

Where is my safepoint?

My new job (at Azul Systems) leads me to look at JIT compiler generated assembly quite a bit. I enjoy it despite, or perhaps because, the amount of time I spend scratching my increasingly balding cranium in search of meaning. On one of these exploratory rummages I found a nicely annotated line in the Zing (the Azul JVM) generated assembly:
gs:cmp4i [0x40 tls._please_self_suspend],0jnz 0x500a0186
Zing is such a lady of a JVM, always minding her Ps and Qs! But why is self suspending a good thing?

Safepoints and Checkpoints

There are a few posts out there on what is a safepoint (here's a nice one going into when it happens, and here is a long quote from Mechnical Sympthy mailing list on the topic). Here's the HotSpot glossary entry:
safepoint
A point during program execution at which all GC roots are known and all heap object contents are consistent. From a global point of view, all threads must block at a safepoint before the GC can run. (As a special case, threads running JNI code can continue to run, because they use only handles. During a safepoint they must block instead of loading the contents of the handle.) From a local point of view, a safepoint is a distinguished point in a block of code where the executing thread may block for the GC. Most call sites qualify as safepoints. There are strong invariants which hold true at every safepoint, which may be disregarded at non-safepoints. 
To summarize, a safepoint is a known state of the JVM. Many operations the JVM needs to do happen only at safepoints. The OpenJDK safepoints are global, while Zing has a thread level safepoint called a checkpoint. The thing about them is that at a safepoint/checkpoint your code must volunteer to be suspended to allow the JVM to capitalize on this known state.
What will happen while you get suspended varies. Objects may move in memory, classes may get unloaded, code will be optimized or deoptimized, biased locks will unbias.... or maybe your JVM will just chill for a bit and catch its breath. At some point you'll get your CPU back and get on with whatever you were doing.
This will not happen often, but it can happen which is why the JVM makes sure you are never too far from a safepoint  and voluntary suspension. The above instruction from Zing's generated assembly of my code is simply that check. This is called safepoint polling.
The safepoint polling mechanism for Zing is comparing a thread local flag with 0. The comparison is harmless as long as the checkpoint flag is 0, but if the flag is set to 1 it will trigger a checkpoint call (the JNZ following the CMP4i will take us there) for the particular thread. This is key to Zing's pause-less GC algorithm as application threads are allowed to operate independently.

Reader Safpoint

Having happily grokked all of the above I went looking for the OpenJDK safepoint.

Oracle/OpenJDK Safepoints

I was hoping for something equally polite in the assembly output from Oracle, but no such luck. Beautifully annotated though the Oracle assembly output is when it comes to your code, it maintains some opaqueness when it's internals are concerned. After some digging I found this:
test   DWORD PTR [rip+0xa2b0966],eax        # 0x00007fd7f7327000
                                                ;   {poll}
No 'please', but still a safepoint poll. The OpenJDK mechanism for safepoint polling is by accessing a page that is protected when requiring suspension at a safepoint, and unprotected otherwise. Accessing a
protected page will cause a SEGV (think exception) which the JVM will handle (nice explanation here). To quote from the excellent Alexey Ragozin blog:
Safepoint status check itself is implemented in very cunning way. Normal memory variable check would require expensive memory barriers. Though, safepoint check is implemented as memory reads a barrier. Then safepoint is required, JVM unmaps page with that address provoking page fault on application thread (which is handled by JVM’s handler). This way, HotSpot maintains its JITed code CPU pipeline friendly, yet ensures correct memory semantic (page unmap is forcing memory barrier to processing cores).
The [rip+0xa2b0966] addressing is a way to save on space when storing the page address in the assembly code. The address commented on the right is the actual page address, and is equal to the rip (Relative Instruction Pointer) + given constant. This saves space as the constant is much smaller than the full address representation. I thank Mr. Tene for clarifying that one up for me.
If we were to look at safepoints throughout the assembly of the same process they would all follow the above pattern of pointing at the same global magic address (via this local relative trick). Setting the magic page to protected will trigger the SEGV for ALL threads. Note that the Time To Safe Point (TTSP) is not reported as GC time and may prove a hidden performance killer for your application. The effective cost of this global safepoint approach goes up the more runnable (and scheduled) threads your application has (all threads must wait for a safepoint consensus before the operation to be carried out at the safepoint can start).


Find The Safpoint Summary

In short, when looking for safepoints in Oracle/OpenJDK assembly search for poll. When looking at Zing assembly search for _please_self_suspend.



Tuesday, 25 February 2014

Unsafe Pointer Chasing: Running With Scissors

Love running? Love scissors? I know just the thing for you! Following on from recent discussion on the Mechanical Sympathy mailing list I see an anti pattern worth correcting in the way people use Unsafe. I say correcting as I doubt people are going to stop, so they might as well be made aware of the pitfalls. This pattern boils down to a classic concurrency bug:

Q: "But... I not be doing no concurrency or nuffin' guv"
A: Using Unsafe to gain a view of on-heap addresses is concurrent access by definition.

Unsafe address: What is it good for?

Absolutely nothing! sayitagain-huh! I exaggerate, if it was good for nothing it would not be there, let's look at the friggin manual:
As we can see the behaviour is only defined if we use the methods together, and by that I mean that get/putAddress are only useful when used with an address that is within a block of memory allocated by allocateMemory. Now undefined is an important word here. It means it might work some of the time... or it might not... or it might crash your VM. Let's think about this.

Q: What type of addresses are produced by allocateMemory?
A: Off-Heap memory addresses -> unmanaged memory, not touched by GC or any other JVM processes

The off-heap addresses are stable from the VM point of view. It has no intention of running around changing them, once allocated they are all yours to manage and if you cut your fingers in the process or not is completely in your control, this is why the behaviour is defined. On-Heap addresses on the other hand are a different story.

Playing With Fire: Converting An Object Ref to An Address

So imagine you just had to know the actual memory address of a given instance... perhaps you just can't resist a good dig under the hood, or maybe you are concerned about memory layout... Here's how you'd go about it:
Now... you'll notice the object ref needs a bit of cuddling to turn into an address. Did I come up with such devilishly clever code myself? No... I will divulge a pro-tip here:
If you are going to scratch around the underbelly of the JVM, learn from as close to the JVM as you can -> from the JDK classes, or failing that, from an OpenJDK project like JOL (another Shipilev production)
In fact, the above code could be re-written to:
Now that we have the address what can we do with it? Could we use it to copy the object? maybe we could read or modify the object state? NO! we can but admire it's numerical beauty and muse on the temperamental values waiting at the other end of that address. The value at the other end of this address may have already been moved by GC...

Key Point: On-Heap Addresses Are NOT Stable

Consider the fact that at any time your code may be paused and the whole heap can be moved around... any address value you had which pointed to the heap is now pointing to a location holding data which may be trashed/outdated/wrong and using that data will lead to a funky result indeed. Also consider that this applies to class metadata or any other internal accounting managed by the JVM.
If you are keen to use Unsafe in the heap, use object references, not addresses. I would urge you not to mix the 2 together (i.e. have object references to off-heap memory) as that can easily lead to a very confused GC trying to chase references into the unknown and crashing your VM.

Case Study: SizeOf an Object (Don't do this)

This dazzling fit of hackery cropped up first (to my knowledge) here on the HighScalability blog:
This is some sweet macheta swinging action :-). The dude who wrote this is not suggesting it is safe, and only claims it is correct on a 32bit VM. And indeed, it can work and passes cursory examination. The author also states correctly that this will not work for arrays and that with some corrections this can be made to work for 64 bit JVMs as well. I'm not going to try and fix it for 64 bit JVMs, though most of the work is already done in the JOL code above. The one flaw in this code that cannot be reliably fixed is that it relies on the native Klass address (line 6) to remain valid long enough for it to chase the pointer through to read the layout helper (line 8). Spot the similarity to the volatile bug above?
This same post demonstrates how to forge references from on-heap objects to off-heap 'objects' which in effect let you cast a pointer to a native reference to an object. It goes on to state that is a BAD IDEA, and indeed it can easily crash your VM when GC comes a knocking (but it might not, I didn't try).

Case Study: Shallow Off-Heap Object Copy (Don't do this)

Consider the following method of making an off-heap copy of an object (from here, Mishadof's blog):
We see the above is using the exact same method for computing size as demonstrated above. It's getting the on-heap object address (limited correctness, see addresses discussion above) than copying the object off-heap and reading it back as a new object copy... Calling the Unsafe.copyMemory(srcAddress, destAddress, length) is inviting the same concurrency bug discussed above. A similar method is demonstrated in the HighScalability post, but  there the copy method used is Unsafe.copyMemory(srcRef, srcOffset, destRef, destOffset, length). This is important as the reference using method is not exposed to the same concurrency issue.
Both are playing with fire ofcourse by converting off-heap memory to objects. Imagine this scenario:
  • a copy of object A is made which refers to another object B, the copy is presented as object C
  • object A is de-referenced leading to A and B being collected in the next GC cycle
  • object C is still storing a stale reference to B which is no managed by the VM
What will happen if we read that stale reference? I've seen the VM crash in similar cases, but it might just give you back some garbage values, or let you silently corrupt some other instance state... oh, the fun you will have chasing that bugger down...

Apologies

I don't mean to present either of the above post authors as fools, they are certainly clever and have presented interested findings for their readers to contemplate without pretending their readers should run along and build on their samples. I have personally commented on some of the code on Mishadof's post and admit my comments were incomplete in identifying the issues discussed above. If anything I aim to highlight that this hidden concurrency aspect can catch out even the clever.
Finally, I would be a hypocrite if I told people not to use Unsafe, I end up using it myself for all sorts of things. But as Mr. Maker keeps telling us "Be careful, because scissors are sharp!"

Thursday, 13 February 2014

When I say final, I mean FINAL!

Having recently bitched about the lack of treatment of final field as final I was urged by Mr. Shipilev to demonstrate the issue in a more structured way (as opposed to a drunken slurred rant), I have now recovered my senses to do just that. The benchmark being run and the queue being discussed are covered in this post, so please refresh you memory for context if you need. The point is clear enough without full understanding of the context though.
It is perhaps a fact well known to those who know it well that final fields, while providing memory visibility guarantees, are not actually immutable. One can always use reflection, or Unsafe, to store new values into those fields, and in fact many people do (and Cliff Click hates them and wishes them many nasty things). This is (I believe) the reason behind some seemingly trivial optimizations not being done by the JIT compiler.

Code Under Test: FFBufferWithOfferBatch.poll()

The buffer field is a final field of FFBufferWithOfferBatch and is being accessed twice in the method above. A trivial optimization on the JIT compiler side would be to load it once into a register and reuse the value. It is 'immutable' after all. But if we look at the generate assembly (here's how to, I also took the opportunity to try out JITWatch which is brilliant):
We can see buffer is getting loaded twice (line 15, and again at line 24). Why doesn't JIT do the optimization? I'm not sure... it may be due to the volatile load forcing a load order that could in theory require the 'new' value in buffer to be made visible... I don't know.

Hack around it, see if it makes a difference

Is that a big deal? Let's find out. The fix is trivial:
And the assembly code generated demonstrates the right behaviour now (one load at line 15):
Now, was that so hard to do? And more importantly, does it make any difference to performance? As discussed previously, the throughput benchmark is sensitive to changes in the cost balance between offer/poll. The optimization creates an interesting change in the pattern of the results:
The benchmark is run on Ubuntu13.10/JDK7u45/i7@2.4, the x axis is the index of the benchmark run and the Y axis is the result in ops/sec. The chart displays the results for before the change (B-*) and after(A-*) with different sparse data settings. We can see the change has accelerated the consumer, leading to increased benefit from sparse data that was not visible before. With sparse data set to 1 the optimization results in a 2% increase in performance. Not mind blowing, but still. The same change applied to the producer thread loop (localizing the reference to the queue field) discussed in the previous post enabled a 10% difference in performance as the field reference stopped the loop from unrolling and was read on each iteration. I used the poll() example here because it involves allot less assembly code to wade through.

Hopefully this illustrates the issue to Mr. Shipilev's content. Thanks goes to Gil Tene for pointing out the optimization to me and to Chris Newland for JITWatch.

Tuesday, 28 January 2014

Picking the 2013 SPSC queue champion: benchmarking woes

I put FFBufferWithOfferBatch together as part of looking into the performance of BQueue and what with all the excitement about JAQ I never introduced it properly. As it's the best performing SPSC(Single Producer Single Consumer) queue I have thus far I thought I should quickly cover it and tell you some horrific benchmarking tales of how I determined it's the best thus far.

Offer like a BQueue

The BQueue was previously presented and discussed here. Offer logic is simple and improves on the original FFBuffer by ensuring clearance to write ahead for the next OFFER_BATCH_SIZE elements. This improvement has 3 effects:
  1. Most of the time we can write without reading and checking for null. We do one volatile read for every OFFER_BATCH_SIZE elements, the rest of the time we compare to the cached field which is a normal field that can be held in a register.
  2. We avoid hitting the full queue induced contention by keeping a certain distance from the head of the queue. Contention can still happen, but it should be quite rare.
  3. We potentially miss out on part of the queue capacity as we will declare the queue full if there are less than OFFER_BATCH_SIZE available when we hit the cached value.https://twitter.com/peterhughesdev
Overall this is not a bad deal.

Poll like an FFBuffer

This poll method is again quite simple. Offset hides the sparse data shift trick that is discussed in depth here. The sparse data trick is a way of reducing contention when nearing the empty queue state. With a sparse shift of 2 only 4 elements in the queue buffer actually share the same cache line (as opposed to 16 when data is not sparse) reducing the frequency with which that state occurs. This has a side effect of hurting memory throughput as each cache line we load has less data in it.

Benchmarks: A Misery Wrapped In An Enema

I've been benchmarking these queues over a long period of time and it is a learning experience (i.e: very painful). Benchmarking is just generally a complex matter, but when it comes to examining and comparing code where a single operation (offer/poll) is in the nano-seconds I quote Mr. Shipilev: "nano-benchmarks are the damned beasts". Now, consider a multi-threaded nano-benchmark...
Let's start with naming the contestants:
Now, the first benchmark I used to quantify the performance of these queues was the one I got from Martin's examples project. I stuck with it for consistency (some cosmetic changes perhaps), though I had some reservations about the section under measurement. Here's how the measured section goes:

The benchmark measures the time it takes for REPETITIONS number of 'messages' to be sent from the producer to the consumer. If the consumer find that the queue is empty it will Thread.yield and try again, similarly if the producer finds that the queue is full it will Thread.yield and try again.
Here's how the queues perform with that benchmark (running on Ubuntu13.10/JDK7u45/i7-4300MQ@2.4 no turbo boost, pinned across cores using taskset, with JVM options: '-server -XX:+UseCondCardMark -XX:CompileThreshold=100000'):
  CAQ(sparse=0) - 130M ops/sec
  FF1(sparse=0) - 155M ops/sec
  FF2(sparse=0) - 185M ops/sec
  CAQ(sparse=2) - 288M ops/sec
  FF1(sparse=2) - 282M ops/sec
  FF2(sparse=2) - 290M ops/sec

When I started on JAQ I took the same benchmark, but changed a few things. I moved the start timestamp to the producing thread and moved the thread join out of the measured section, I also added some counters for when the producer fails to offer (queue is full) and the consumer fails to poll (queue is empty). Finally I took this same benchmark and ported it to use my very own ConcurrentQueue interface. Here's how it looked:

You would think this sort of change could only make performance slightly worse, if it made any difference at all. Certainly that's what I thought, but I was wrong... With the new benchmark I got the following results:
  FF2(sparse=0) - 345M ops/sec
  FF2(sparse=2) - 327M ops/sec


WTF? What is the difference?

I looked long and hard at the benchmark and realized the only difference, that shouldn't really make a difference, but the only difference that does make a difference is the localization of the queue field to the producer loop (found by process of elimination). Tweaking the original benchmark to localize the queue reference gives us different results for all queues:
  CAQ(sparse=0) - 291M ops/sec
  FF1(sparse=0) - 190M ops/sec
  FF2(sparse=0) - 348M ops/sec
  CAQ(sparse=2) - 303M ops/sec
  FF1(sparse=2) - 287M ops/sec
  FF2(sparse=2) - 330M ops/sec

So we notice things have not improved much for FF1, but for CAQ we now have only a marginal difference between using sparse data and not using it, and for FF2 it is actually better not to bother at all with sparse data. The localization of the queue reference made the producer faster, reducing the number of times the empty queue state is hit and thus reducing the need for sparse data. We can try to validate this claim by running the Queue variation of this benchmark with the counters. With the reference localized:
  CAQ(sparse=0) - 291M ops/sec, poll fails 350,    offer fails 0
  FF1(sparse=0) - 192M ops/sec, poll fails 32000,  offer fails 0
  FF2(sparse=0) - 348M ops/sec, poll fails 150,    offer fails 13000
  CAQ(sparse=2) - 303M ops/sec, poll fails 270,    offer fails 0
  FF1(sparse=2) - 287M ops/sec, poll fails 200,    offer fails 0
  FF2(sparse=2) - 330M ops/sec, poll fails 170,    offer fails 10

So we can see that adding the extra counters made little difference with the reference localized, but when referencing the field directly in the producer loop:
  CAQ(sparse=0) - 167M ops/sec, poll fails 2400,   offer fails 0
  FF1(sparse=0) - 160M ops/sec, poll fails 100000, offer fails 0
  FF2(sparse=0) - 220M ops/sec, poll fails 2000,   offer fails 0
  CAQ(sparse=2) - 164M ops/sec, poll fails 31000,  offer fails 0
  FF1(sparse=2) - 250M ops/sec, poll fails 2000,   offer fails 0
  FF2(sparse=2) - 255M ops/sec, poll fails 500,    offer fails 0

We get a different picture again, in particular for CAQ. To get to the bottom of the exact effect localizing the reference had on this benchmark I will have to dig into the generated assembly. This will have to wait for another post...


Conclusions And Confusions

The overall winner, with all the different variations on the throughput benchmarks, is the FFBufferWithOfferBatch. It's also the winner of the previously presented/discussed latency benchmarks(part 1, part 2). With turbo boost on it hits a remarkable high of 470M ops/sec. But setting this result to one side, the above results highlight a flaw in the throughput benchmark that is worth considering in the context of other concurrent benchmarks, namely that changing the [queue] implementation under test can change the benchmark.
Let me elaborate on this last point. What I read into the results above is that the benchmark was initially a benchmark where the queue full state was never hit, and the queue empty state was hit more or less depending on the queue implementation. Given FF2 is only a change to the offer method of FF1 we can see how tilting the balance between the offer cost and poll cost changed the nature of the test. In particular when using no sparse data the producer turned out to be significantly faster than the consumer. But... in changing the implementation we have unbalanced the benchmark, it is hardly ever hitting an empty queue, which would slow down the consumer/producer threads. We have switched to measuring the full speed of consumption as a bottleneck, but only for one queue implementation. So this is, in a way, not the same benchmark for all queues.
While this benchmark is still useful to my mind, it is worth considering that it is in fact benchmarking different use cases for different queues and that the behaviour of the using code will inevitably balance the producer/consumer costs quite differently. All in all, yet another case of your mileage may vary ;-)

Thanks to Peter, Darach and Mr. Gomez

Thursday, 16 January 2014

JAQ: Using JMH to Benchmark SPSC Queues Latency - Part II

Just quickly sharing more results/insights from running different configuration of the SPSC latency benchmark discussed in the previous post. The previous post reviewed the different implementations behaviour when sending different sizes of bursts, in this post I'll have a look at the impact of the length of the chain (number of threads passing the messages from one to the other returning to the originator) on latency. The benchmarks and queues stay the same, so you may have to skip back to the previous post for context.

TL;DR

This post may not be of great popular appeal... it's a verification of the behaviour of current implementations of SPSC in terms of single threaded operation costs and RTT latency for different 'trip' lengths. The main finding of interest (for me) is that the sparse data method previously discussed here is proving to be a hindrance to latency performance. This finding is discussed briefly at the end and is something I may dig into more deeply in the near future. Other more minor nuggets were found... but anyhow, lets get to it. 

Chain length 1: The cost of talking to yourself

When I had Darach review the original post he pointed out to me that I neglected to cover this edge case in which a queue is being used by a single thread acting as both the consumer and the producer. The result is interesting as a baseline of operation costs when no concurrency is involved. The benchmark is very simple:


The results give us a sense of the cost of [(offer * burst size) + (poll * burst size)] for the different implementations. I added one of my favourite collections, the ArrayDequeue, to compare the efficiency of a non-thread safe queue implementation with my furry gremlins. Here's the score for AD vs CLQ for the different burst sizes:

This is a bit of a bland presentation, so feel free to call me names (you bland representor of numbers!). The numbers are here if you want to play. All the numbers are in nanos per op, mean error in brackets. Benchmark run on laptop(Ubuntu13.10/JDK7u45/i7-4700MQ@2.4 capped). Here's the score for CAQ/FF2 with sparse shift set to 0 and 2:


Points for consideration:
  • CLQ (ConcurrentLinkedQueue) is the worst. This is completely the wrong tool for the job which is why this result should be a lesson in why using concurrent data structures where you don't need them will suck.
  • Overall AD(ArrayDequeue) is the best from burst size 4 and up. This is the right tool for the job when your consumer and producer are the same thread. Choose the right tool, get the right result.
  • CAQ(ConcurrentArrayQueue) and FF2(FFBufferWithOfferBatch) start off as similar or better than AD, but soon fall behind. Interestingly FF2 starts as the winner of the 3 and ends as the loser with CAQ overtaking it from burst size 64 and onward.

I found it interesting to see how the behaviour changed as the burst size changes, and I thought the single threaded cost estimates were valuable as enablers of other estimates, but overall this is not a benchmark to lose sleep over.
Note the CAQ/FF2 are measured in 2 configurations with and without sparse data. The use of sparse data is counter productive in this use case as it aims to reduce contention that is not there, and inhibits throughput in the process.

Stable Chains: Working across the cores

Due to the limits of my benchmarking machine I can only demonstrate stable behaviour for chains of length 1 to 4 without resorting to pinning threads to cores. Still, the results offer some insight. The benchmark in use is the one discussed previously here. The whole process is pinned such that the only available inter-thread channel is across cores. On my machine I have a hyper-threaded quad-core CPU so 8 logical cores from a taskset perspective. The even/odd numbers are on different cores. For my runs I pinned all other processes to core 0 and pinned the JMH process to different cores (as many as I needed from 1,3,5,7). This is not ideal, but is workable.
I didn't bother with CLQ for this set of results. Just CAQ and FF2. The benchmark code is here.
Here's chain length 2 (T1 -> T2 -> T1):


Here's chain length 3 (T1 -> T2 -> T3 -> T1):

Here's chain length 4 (T1 -> T2 -> T3 -> T4 -> T1):


This is data collected from (10 warmup iterations + 10 iterations) * 10 forks, so benchmarking took a while. This should mean the data is slightly more than anecdotal...
Conclusions:

  • FF2 is consistent proving to be the best queue in terms of inter-thread latency. This advantage is maintained for all burst sizes and chain lengths.
  • Sparse data is only marginally improving results consistently for burst size 1. After that it seems to have a small positive effect on CAQ with chain length 3/4 with small bursts. This may be a quirk of the use case demonstrated by the benchmark but it at least shows this method has a demonstrable down side.
  • The burst RTT does not grow linearly with the burst size, in particular for the smaller bursts. This is due to the bursts filling up the time-to-notify period with write activity.
Here's the latency for FF2(sparse=0) across all chains:

  • Single threaded cost grows linearly with size, this is how we are used to think about cost. It's also a lower boundary for the round trip use case as the round trip requires at least the amount of work done in the single thread case to still be done in the measuring/source thread.
  • Once we increase the chain length we get this initial plateau steadily increasing in slope to become linear again. Note that for the 2 threaded case the costs are nearly the same as the single thread case from burst size 512.

Where to next?

I've had a long break over Xmas, some progress was made on JAQ, but not as much as I'd have liked... I actually spent most of the time away from a keyboard. There are now MPSC/SPMC implementations, and a direct ConcurrentQueue implementation for SPSC hooked into the factory and an MPSC one nearing completion. I've had some interest from different people and am working towards meeting their requirements/needs... we'll get there eventually :-)

Thanks Norman M. for his review (he did tell me to add graphs... my bad)