{This post is part of a long running series on lock free queues, checkout the full index to get more context here}
Applying lessons learnt about memory layout to prev. discussed SPSC (Single Producer Single Consumer) queue. Tackling and observing some manifestations of false sharing. Getting bitch slapped by reality.
In preparations to implement an MPMC (Many Producers Many Consumers) queue I went back to Martin Thompson's SPSC queue I dissected in detail in this blog post. I was going to use it as the basis for the new queue with a few changes and discuss the transition. In particular an opportunity offered in the implementation of said queue with the addition of the new getAndAdd intrinsic to JDK 8. I was going to... but then I thought, 'let me try a couple more things!' :-).
Where were we?
In case you can't be bothered to go back and read the whole thing again, here's a quick summary. Mr. Thompson open sourced a few samples he discusses in his presentation on lock-free algorithms, in particular a SPSC queue developed across a few stages. I broke down the stages further and benchmarked before and after to explore the effect of each optimisation as it is applied. I then ported the same queue into an off-heap implementation and used it to demonstrate a fast IPC mechanism capable of sending 135M messages per second between processes. The queue demonstrated the following techniques:
Snowman contemplating evolution |
- lock free, single writer principle observed.
- Set capacity to power of 2, use mask instead of modulo.
- Use lazySet instead of volatile write.
- Minimize volatile reads by adding local cache fields.
- Pad all the hot fields: head, tail, headCache,tailCache to avoid false sharing.
To expose the variance I modified the test to produce a summary line (each test runs 20 iterations, the summary is the average of the last 10 test iterations) and ran it 30 times. The results demonstrate the variance (results are sorted to highlight the range, X-axis is the run index, Y-axis is a summary of the run, SC means producer and consumer run on the same core, CC means they run across cores on the same socket):
Original queue performance |
So I thought, let's poke the i's and kick the t's, see if we can shake the variance out of the bugger.
Terms & Conditions
Benchmarks were carried out on a dual socket Xeon running CentOS 6 and Oracle JDK 1.7u17. Affinity was set using taskset, the scripts used are included with the code as well as the raw results and the LibreOffice spread sheets used for the graphs. Furry, fluffy animals were not harmed and people of different religions and races were treated with the outmost respect and afforded equal opportunity. I ran the cross socket tests, and the data is included, but I chose not to discuss them as no improvement was made on that front and they would make this already long post longer.
Flatten me counters!
To start off, I was not thrilled with the way the counter padding was implemented, for 2 reasons:
- By using container classes for the counters we introduce indirection, we could optimise by inlining the data into the queue structure.
- The Padded* classes only pad to one side, we are counting on the instances to be laid out together because they are instantiated together. In the right environment I'm pretty sure this can go wrong. By go wrong I mean the instances might get allocated/placed next to data modified elsewhere leading to false sharing. By inlining the counters and forcing strict order we could kill 2 bird with one stone.
It ain't pretty, but it does the job. The layout can be verified using the excellent java-object-layout tool (I format the output for brevity):
psy.lob.saw.queues.spsc1.SPSCQueue1
offset size type description
0 12 (assumed to be the object header + first field alignment)
12 4 int ColdFields.capacity
16 4 int ColdFields.mask
20 4 Object[] ColdFields.buffer
24-72 8 long L1Pad.p10-16
80 8 long TailField.tail
88-136 8 long L2Pad.p20-p26
144 8 long HeadCache.headCache
152-200 8 long L3Pad.p30-36
208 8 long HeadField.head
216-264 8 long L4Pad.p40-46
272 8 long TailCache.tailCache
280-328 8 long L5Pad.p50-56
336 (object boundary, size estimate)
Wicked! Run the same tests above to see what happened:
We get a small improvement when running on the same core (average 3% improvement), but the cross core behaviour is actually worse. Bummer, keep kicking.
Padding the class fields
If we look at the above memory layout, we'll notice the fields capacity, mask and buffer are left flush against the object header. This means that they are open to false sharing with other objects/data allocated on the same cache line. We can add a further love handle on that big boy to cover that front:
Note that by happy coincidence we have already padded the tail side of the fields as part of our flattening exercise. So now the layout is:
psy.lob.saw.queues.spsc3.SPSCQueue3
offset size type description
0 12 (assumed to be the object header + first field alignment)
12 4 (alignment/padding gap)
16-72 8 long L0Pad.p00-07
80 4 int ColdFields.capacity
84 4 int ColdFields.mask
88 4 Object[] ColdFields.buffer
92 4 (alignment/padding gap)
96-144 8 long L1Pad.p10-16
152 8 long TailField.tail
160-208 8 long L2Pad.p20-26
216 8 long HeadCache.headCache
224-272 8 long L3Pad.p30-36
280 8 long HeadField.head
288-336 8 long L4Pad.p40-46
344 8 long TailCache.tailCache
352-400 8 long L5Pad.p50-56
408 (object boundary, size estimate)
Try again and we get:
Same single core behaviour as above, but the cross core behaviour looks stable. Sadly the cross core results are worse than the original in many cases. Good thing there is one more trick up our sleeves.
Padding the data
So, this last step may seem a bit over the top, padding the sides of the buffer as well as spacing out all the bloody fields. Surely you must be joking? I padded the buffer by allocating an extra 32 slots to the array and skipping the first 16 on access. The object layout remains the same, you'll have to imagine the extra padding (code is here). But the results are:
Run fat Q! Run! This is a nice kick in the single core category (10% increase) and in the cross core it is pretty flat indeed. So when the original behaves they are much the same, but the average result is a 10% improvement. Very little variance remains.
But it's UGLY!
Got me there, indeed it is. Why do we need to go this far to get rid of flaky behaviour? This is one of
them times when having a direct line to memory management really helps, and Java has always been about you not having to worry about this sort of thing. There is help on the way in the form of the @Contended annotation which would render the above class much nicer, but even then you will need to pad the buffer by yourself. If you look at how the OffHeapQueue manages this issue, you'll see that less padding is required when you know the data alignment. Sadly there is no @Align(size=byte|short|int|long|cache|page) annotation coming anytime soon, and the @Contended annotation is not clear on how you could go about marking an instance rather than a class/field as contended.
Y do we care? |
Hang on, what if we do it the other way around?
For all you logical minded people out there who think: "we applied the changes together, but not separately. What if this is all about the data padding? we could fix the original without all this fuss...". I feel your pain obsessive brothers. So I went back and added 2 variants, one of a padded data original(referred to as O2 in the graphs) and another of the padded data and inlined fields without the class field padding. I was surprised by the results:
Padding the data, of and by itself made things worse for the original implementation and the inlined implementation when running on the same core. Padding the data sucks!
Cross Core comparison of data padding impact |
Padding the cold fields by itself also did little for the performance, as demonstrated above, but removed some of the variance. The 2 put together however gave us some extra performance, and killed the variance issue. Why? I don't know... But there you go, holistic effects of memory layout at work.
A happy ending?
Well... sort of. You see, all the above tests were run with the following JVM options:
-XX:+UseNUMA -XX:+UseCondCardMark -XX:CompileThreshold=100000And thus the results present a certain truth, but maybe not all of the truth for everyone all of the time... I decided to waste some electricity and rerun some variations of the above options to see what happens. Running with no JVM options I got the following:
Cross core - no JVM opts |
- I4 which is the final inlined version is still quite stable, but it's performance lags behind the other implementations.
- O1 which is the original implementation has less variance then before (could be luck, who knows) and has the best average result.
Same core - no JVM opts |
- This time I3 (inlined counters, padded class) is the clear winner.
- I2 (inlined counters, padded data) is second, followed by I4 and I1.
- When running on the same core, inlining the counters delivers a minor performance win.
ConcCardMark cross core |
- Using CondCardMark has reduced the range somewhat, but still a fair range for all.
- I4 and I3 are stable, but the overall best average score goes to I1 (133M).
- I4 is overall slightly better then O1 but worse then O2(original implementation with padded data)
ConcCardMark same core |
- I1 is the best by quite a bit, followed by I4.
-XX:+UseCondCardMark -XX:CompileThreshold=100000 - Cross Core |
- With the increased compile threshold O2 has pulled to the top followed by I4 and O1.
- I4 is looking very much like O1
- I1, which was the clear winner a second ago, is now rubbish.
-XX:+UseCondCardMark -XX:CompileThreshold=100000 - Same Core |
- On the same core we are seeing now the same behaviour we saw in the beginning.
- I4 is the clear winner, I1 and I3 pushed down to second place etc.
- This is odd, why would giving the JIT more data for compilation push the performance of I1 down?
All together now - Cross Core |
- I4 is the most stable, although O2 has the best overall average throughput.
- Comparing with all the other combinations we seem to have degraded the performance of other options on our way to find the best option for I4 :(
All together now - Same Core |
What does it all mean?
At this point I got quite frustrated. The above approach was producing improvements to the variance, and even an improvement to overall performance on occasion, but the effect was not as decisive and clear as I'd have liked. Some of the variation I just could not shake, even with the best result above I4 is still moving between 110M and 120M and has a spike on either side of this range.
These results are a fine demonstration of how time consuming and error prone the estimation of performance can be. To collect the above results I had to setup the same test to run for each implementation 30 times for each of the affinity options (same core/cross core/cross socket) and repeat
for 4 JVM options combinations (8 impls * 30 runs * 3 affinity setups * 4 JVM_OPTS + extra trial and error runs... = lots of time, quite a bit of data). This result is likely to be different on different hardware, it is likely to change with JDK8, other JVM options and so on. Even with all this effort of collecting data I am very far from having a complete answer. Is this data enough to produce meaningful results/insights at all?
for 4 JVM options combinations (8 impls * 30 runs * 3 affinity setups * 4 JVM_OPTS + extra trial and error runs... = lots of time, quite a bit of data). This result is likely to be different on different hardware, it is likely to change with JDK8, other JVM options and so on. Even with all this effort of collecting data I am very far from having a complete answer. Is this data enough to produce meaningful results/insights at all?
To a certain extent there is evident progress here towards eliminating some of the sources of the run to run variation. Looking at the above results I feel justified in my interpretation of the object layout and the test results. In running the same code on other hardware I've observed good results for I4 and similar variation for the O1, so not all is lost. But this journey is, surprisingly enough, still not over...
More, more, more!
David Hume |
- I explore cross socket performance and it's implications
- We contemplate the ownership of memory and it's impact
- The original implementation is evolved further
- We hit a high note with 250M ops per second
Cool work. Heisenberg would be proud.
ReplyDeleteWould be interesting to add double line padding to your experiments, BTW, to see if L2 streaming prefetch (aka adjacent line prefetch or adjacent sector prefetch, depending on core type and BIOS/HW vendor) is in play. Also check/play with BIOS settings that turn this on and off, to add to the fun.
Thanks :) Mr. Heisenberg has allot to answer for.
DeleteAs I mention, this is far from complete analysis. More of a shared tale of woe. I have considered double padding, in fact I believe @Contended went down that path just to be on the safe side. Might give that a go at some point.
Have you considering isolating a CPU set and then binding the JVM to the isolated CPU set? Also, I've found speculative execution can be a cause of variance and this is typically well addressed in my C/C++ coding with the PAUSE instruction in busy spin loops.
ReplyDeleteI've considered it, but given this is a throughput comparison and the effects are uniform I'm not sure how much relative difference it would make. I can see how it would improve overall performance. The machine I use is very quiet and fairly well tuned.
DeleteIf I have time I'd like to repeat the experiment in C/C++ and see if I'm falsely accusing the JVM for some of the variance, in which case I will make a mental note to replace the Thread.yield() with a PAUSE. If the cause for variance is indeed speculative execution, how would you suggest I eliminate it in Java?
You might be surprised how a machine is not as quiet as you think.
Deletehttp://vanillajava.blogspot.co.uk/2013/07/micro-jitter-busy-waiting-and-binding.html
You can see how Doug plays with some bit twiddling in the Exchanger as a means of trying to avoid speculative execution.
http://gee.cs.oswego.edu/cgi-bin/viewcvs.cgi/jsr166/src/main/java/util/concurrent/Exchanger.java?view=markup
Sir can u give me idea why should we use this queue implementation instead of LinkedList .which is also do the same.
ReplyDeletealso if i can use this queue.by which approach i can use this queue in multi-threading [multiple producer threads & multiple consumer threads]. first approach :: synchronized block with queue object and notify it. second approach:: reentrantlock.
I'm no Sir :)
DeleteThe above queue is:
1. Concurrently accessed, so LinkedList would not work
2. Bounded, again unlike LinkedList
Now, the merits of bounded vs. unbounded are out of scope, but lets focus on the concurrency aspect. You propose we fix this issue using a lock (in whichever implementation). The above implementation is lock free (actually wait free) so provides very different guarantees to a lock based approach. The above described approach also vastly outperforms the lock based implementations.
So to summarize motivation:
1. Lock/wait free guarantees
2. Performance