Monday 15 July 2013

Single Producer Single Consumer Queue Revisited: An empiricist tale - part I


{This post is part of a long running series on lock free queues, checkout the full index to get more context here}
Applying lessons learnt about memory layout to prev. discussed SPSC (Single Producer Single Consumer) queue. Tackling and observing some manifestations of false sharing. Getting bitch slapped by reality.

In preparations to implement an MPMC (Many Producers Many Consumers) queue I went back to Martin Thompson's SPSC queue I dissected in detail in this blog post. I was going to use it as the basis for the new queue with a few changes and discuss the transition. In particular an opportunity offered in the implementation of said queue with the addition of the new getAndAdd intrinsic to JDK 8. I was going to... but then I thought, 'let me try a couple more things!' :-).

Where were we?

In case you can't be bothered to go back and read the whole thing again, here's a quick summary. Mr. Thompson open sourced a few samples he discusses in his presentation on lock-free algorithms, in particular a SPSC queue developed across a few stages. I broke down the stages further and benchmarked before and after to explore the effect of each optimisation as it is applied. I then ported the same queue into an off-heap implementation and used it to demonstrate a fast IPC mechanism capable of sending 135M messages per second between processes. The queue demonstrated the following techniques:
Snowman contemplating
evolution
  • lock free, single writer principle observed.
  • Set capacity to power of 2, use mask instead of modulo. 
  • Use lazySet instead of volatile write.
  • Minimize volatile reads by adding local cache fields.
  • Pad all the hot fields: head, tail, headCache,tailCache to avoid false sharing.
So... what's to improve on? There were a few niggles I had looking at this code again, some I've mentioned before. The last time I benchmarked the original queue implementation I noticed high run to run variance in the results. This was particularly prominent when running across 2 cores on the same socket or across sockets.
To expose the variance I modified the test to produce a summary line (each test runs 20 iterations, the summary is the average of the last 10 test iterations) and ran it 30 times. The results demonstrate the variance (results are sorted to highlight the range, X-axis is the run index, Y-axis is a summary of the run, SC means producer and consumer run on the same core, CC means they run across cores on the same socket):


Original queue performance
OUCH! We get half the performance 20% of the time. The results were very stable within a given run, leading me to believe there was a genuine issue at play.

So I thought, let's poke the i's and kick the t's, see if we can shake the variance out of the bugger.

Terms & Conditions

Benchmarks were carried out on a dual socket Xeon running CentOS 6 and Oracle JDK 1.7u17. Affinity was set using taskset, the scripts used are included with the code as well as the raw results and the LibreOffice spread sheets used for the graphs. Furry, fluffy animals were not harmed and people of different religions and races were treated with the outmost respect and afforded equal opportunity. I ran the cross socket tests, and the data is included, but I chose not to discuss them as no improvement was made on that front and they would make this already long post longer. 


Flatten me counters!

To start off, I was not thrilled with the way the counter padding was implemented, for 2 reasons:
  1. By using container classes for the counters we introduce indirection, we could optimise by inlining the data into the queue structure.
  2. The Padded* classes only pad to one side, we are counting on the instances to be laid out together because they are instantiated together. In the right environment I'm pretty sure this can go wrong. By go wrong I mean the instances might get allocated/placed next to data modified elsewhere leading to false sharing. By inlining the counters and forcing strict order we could kill 2 bird with one stone.
To inline the counters, and provide the padding required to provide false-sharing protection I used inheritance to force layout (as outlined previously here). I used Unsafe to get field offset and implement lazySet directly into the fields inlined in my class (this is replacing the original PaddedLong/PaddedAtomicLong, the same method is used in AtomicLong to implement lazySet) the code:

It ain't pretty, but it does the job. The layout can be verified using the excellent java-object-layout tool (I format the output for brevity):
psy.lob.saw.queues.spsc1.SPSCQueue1
 offset  size     type description
      0    12          (assumed to be the object header + first field alignment)
     12     4      int ColdFields.capacity
     16     4      int ColdFields.mask
     20     4 Object[] ColdFields.buffer
     24-72  8     long L1Pad.p10-16
     80     8     long TailField.tail
     88-136 8     long L2Pad.p20-p26
    144     8     long HeadCache.headCache
    152-200 8     long L3Pad.p30-36
    208     8     long HeadField.head
    216-264 8     long L4Pad.p40-46
    272     8     long TailCache.tailCache
    280-328 8     long L5Pad.p50-56
    336                (object boundary, size estimate)

Wicked! Run the same tests above to see what happened:
Original vs. Inlined counters(I1)

We get a small improvement when running on the same core (average 3% improvement), but the cross core behaviour is actually worse. Bummer, keep kicking. 

Padding the class fields

If we look at the above memory layout, we'll notice the fields capacity, mask and buffer are left flush against the object header. This means that they are open to false sharing with other objects/data allocated on the same cache line. We can add a further love handle on that big boy to cover that front:

Note that by happy coincidence we have already padded the tail side of the fields as part of our flattening exercise. So now the layout is:
psy.lob.saw.queues.spsc3.SPSCQueue3
 offset  size     type description
      0    12          (assumed to be the object header + first field alignment)
     12     4          (alignment/padding gap)
     16-72  8     long L0Pad.p00-07
     80     4      int ColdFields.capacity
     84     4      int ColdFields.mask
     88     4 Object[] ColdFields.buffer
     92     4          (alignment/padding gap)
     96-144 8     long L1Pad.p10-16
    152     8     long TailField.tail
    160-208 8     long L2Pad.p20-26
    216     8     long HeadCache.headCache
    224-272 8     long L3Pad.p30-36
    280     8     long HeadField.head
    288-336 8     long L4Pad.p40-46
    344     8     long TailCache.tailCache
    352-400 8     long L5Pad.p50-56
    408                (object boundary, size estimate)

Try again and we get:
Original vs Inlined counters and padded class(I3)


Same single core behaviour as above, but the cross core behaviour looks stable. Sadly the cross core results are worse than the original in many cases. Good thing there is one more trick up our sleeves.

Padding the data

So, this last step may seem a bit over the top, padding the sides of the buffer as well as spacing out all the bloody fields. Surely you must be joking? I padded the buffer by allocating an extra 32 slots to the array and skipping the first 16 on access. The object layout remains the same, you'll have to imagine the extra padding (code is here). But the results are:
Original vs Inlined counters, padded class and padded data(I4)


Run fat Q! Run! This is a nice kick in the single core category (10% increase) and in the cross core it is pretty flat indeed. So when the original behaves they are much the same, but the average result is a 10% improvement. Very little variance remains.

But it's UGLY!

Got me there, indeed it is. Why do we need to go this far to get rid of flaky behaviour? This is one of
Y do we care?
them times when having a direct line to memory management really helps, and Java has always been about you not having to worry about this sort of thing. There is help on the way in the form of the @Contended annotation which would render the above class much nicer, but even then you will need to pad the buffer by yourself. If you look at how the OffHeapQueue manages this issue, you'll see that less padding is required when you know the data alignment. Sadly there is no @Align(size=byte|short|int|long|cache|page) annotation coming anytime soon, and the @Contended annotation is not clear on how you could go about marking an instance rather than a class/field as contended.

Hang on, what if we do it the other way around?

For all you logical minded people out there who think: "we applied the changes together, but not separately. What if this is all about the data padding? we could fix the original without all this fuss...". I feel your pain obsessive brothers. So I went back and added 2 variants, one of a padded data original(referred to as O2 in the graphs) and another of the padded data and inlined fields without the class field padding. I was surprised by the results:

Same Core comparison of data padding impact


Padding the data, of and by itself made things worse for the original implementation and the inlined implementation when running on the same core. Padding the data sucks!
Cross Core comparison of data padding impact
When we look at the cross core results we can see some benefit from the data padding, suggesting it is part of the issue we are trying to solve, but not the whole story.
Padding the cold fields by itself also did little for the performance, as demonstrated above, but removed some of the variance. The 2 put together however gave us some extra performance, and killed the variance issue. Why? I don't know... But there you go, holistic effects of memory layout at work.

A happy ending?

Well... sort of. You see, all the above tests were run with the following JVM options:
-XX:+UseNUMA -XX:+UseCondCardMark -XX:CompileThreshold=100000
And thus the results present a certain truth, but maybe not all of the truth for everyone all of the time... I decided to waste some electricity and rerun some variations of the above options to see what happens. Running with no JVM options I got the following:
Cross core - no JVM opts
  • I4 which is the final inlined version is still quite stable, but it's performance lags behind the other implementations.
  • O1 which is the original implementation has less variance then before (could be luck, who knows) and has the best average result.

Same core - no JVM opts

  • This time I3 (inlined counters, padded class) is the clear winner.
  • I2 (inlined counters, padded data) is second, followed by I4 and I1.
  • When running on the same core, inlining the counters delivers a minor performance win.
Running with -XX:+UseCondCardMark:
ConcCardMark cross core
  • Using CondCardMark has reduced the range somewhat, but still a fair range for all.
  • I4 and I3 are stable, but the overall best average score goes to I1 (133M).
  • I4 is overall slightly better then O1 but worse then O2(original implementation with padded data)
ConcCardMark same core
  • I1 is the best by quite a bit, followed by I4.
Running with -XX:+UseCondCardMark -XX:CompileThreshold=100000:
-XX:+UseCondCardMark -XX:CompileThreshold=100000 - Cross Core
  • With the increased compile threshold O2 has pulled to the top followed by I4 and O1.
  • I4 is looking very much like O1
  • I1, which was the clear winner a second ago, is now rubbish.

-XX:+UseCondCardMark -XX:CompileThreshold=100000 - Same Core
  • On the same core we are seeing now the same behaviour we saw in the beginning.
  • I4 is the clear winner, I1 and I3 pushed down to second place etc.
  • This is odd, why would giving the JIT more data for compilation push the performance of I1 down?
And for comparison the results from -XX:+UseNUMA -XX:+UseCondCardMark -XX:CompileThreshold=100000 presented together:
All together now - Cross Core
  • I4 is the most stable, although O2 has the best overall average throughput.
  • Comparing with all the other combinations we seem to have degraded the performance of other options on our way to find the best option for I4 :(
All together now - Same Core

What does it all mean?

At this point I got quite frustrated. The above approach was producing improvements to the variance, and even an improvement to overall performance on occasion, but the effect was not as decisive and clear as I'd have liked. Some of the variation I just could not shake, even with the best result above I4 is still moving between 110M and 120M and has a spike on either side of this range.
These results are a fine demonstration of how time consuming and error prone the estimation of performance can be. To collect the above results I had to setup the same test to run for each implementation 30 times for each of the affinity options (same core/cross core/cross socket) and repeat
for 4 JVM options combinations (8 impls * 30 runs * 3 affinity setups * 4 JVM_OPTS + extra trial and error runs... = lots of time, quite a bit of data). This result is likely to be different on different hardware, it is likely to change with JDK8, other JVM options and so on. Even with all this effort of collecting data I am very far from having a complete answer. Is this data enough to produce meaningful results/insights at all?
To a certain extent there is evident progress here towards eliminating some of the sources of the run to run variation. Looking at the above results I feel justified in my interpretation of the object layout and the test results. In running the same code on other hardware I've observed good results for I4 and similar variation for the O1, so not all is lost. But this journey is, surprisingly enough, still not over...

More, more, more!

David Hume
If you found this riveting tale of minute implementation changes and their wacky effect on performance absorbing you will love the next chapter in which:
  • I explore cross socket performance and it's implications
  • We contemplate the ownership of memory and it's impact
  • The original implementation is evolved further
  • We hit a high note with 250M ops per second
This post is long enough as is :-)

7 comments:

  1. Cool work. Heisenberg would be proud.

    Would be interesting to add double line padding to your experiments, BTW, to see if L2 streaming prefetch (aka adjacent line prefetch or adjacent sector prefetch, depending on core type and BIOS/HW vendor) is in play. Also check/play with BIOS settings that turn this on and off, to add to the fun.

    ReplyDelete
    Replies
    1. Thanks :) Mr. Heisenberg has allot to answer for.
      As I mention, this is far from complete analysis. More of a shared tale of woe. I have considered double padding, in fact I believe @Contended went down that path just to be on the safe side. Might give that a go at some point.

      Delete
  2. Have you considering isolating a CPU set and then binding the JVM to the isolated CPU set? Also, I've found speculative execution can be a cause of variance and this is typically well addressed in my C/C++ coding with the PAUSE instruction in busy spin loops.

    ReplyDelete
    Replies
    1. I've considered it, but given this is a throughput comparison and the effects are uniform I'm not sure how much relative difference it would make. I can see how it would improve overall performance. The machine I use is very quiet and fairly well tuned.
      If I have time I'd like to repeat the experiment in C/C++ and see if I'm falsely accusing the JVM for some of the variance, in which case I will make a mental note to replace the Thread.yield() with a PAUSE. If the cause for variance is indeed speculative execution, how would you suggest I eliminate it in Java?

      Delete
    2. You might be surprised how a machine is not as quiet as you think.

      http://vanillajava.blogspot.co.uk/2013/07/micro-jitter-busy-waiting-and-binding.html

      You can see how Doug plays with some bit twiddling in the Exchanger as a means of trying to avoid speculative execution.

      http://gee.cs.oswego.edu/cgi-bin/viewcvs.cgi/jsr166/src/main/java/util/concurrent/Exchanger.java?view=markup

      Delete
  3. Sir can u give me idea why should we use this queue implementation instead of LinkedList .which is also do the same.
    also if i can use this queue.by which approach i can use this queue in multi-threading [multiple producer threads & multiple consumer threads]. first approach :: synchronized block with queue object and notify it. second approach:: reentrantlock.

    ReplyDelete
    Replies
    1. I'm no Sir :)
      The above queue is:
      1. Concurrently accessed, so LinkedList would not work
      2. Bounded, again unlike LinkedList

      Now, the merits of bounded vs. unbounded are out of scope, but lets focus on the concurrency aspect. You propose we fix this issue using a lock (in whichever implementation). The above implementation is lock free (actually wait free) so provides very different guarantees to a lock based approach. The above described approach also vastly outperforms the lock based implementations.
      So to summarize motivation:
      1. Lock/wait free guarantees
      2. Performance

      Delete

Note: only a member of this blog may post a comment.