Psychosomatic, Lobotomy, Saw: SPSC Revisited - part II: Floating vs Inlined Counters

Tuesday, 23 July 2013

SPSC Revisited - part II: Floating vs Inlined Counters

{This post is part of a long running series on lock free queues, checkout the full index to get more context here}
Continued from here, we examine the comparative performance of 2 approaches to implementing queue counters.
If we trace back through the journey we've made with our humble SPSC queue you'll note a structural back and forth has happened. In the first post, the first version had the following layout (produced using the excellent java-object-layout tool):

uk.co.real_logic.queues.P1C1QueueOriginal1
offset size type description
0 12 (assumed to be the object header + first field alignment)
   12 4 Object[] P1C1QueueOriginal1.buffer
   16 8 long P1C1QueueOriginal1.tail
   24 8 long P1C1QueueOriginal1.head
32 (object boundary, size estimate)

We went on to explore various optimisations which were driven by our desire to:

Replace volatile writes with lazySet
Reduce false sharing of hot fields

To achieve these we ended up extracting the head/tail counters into their own objects and the layout turned to this:

uk.co.real_logic.queues.P1C1QueueOriginal3
offset size type description
0 12 (assumed to be the object header + first field alignment)
   12 4 int P1C1QueueOriginal3.capacity
   16 4 int P1C1QueueOriginal3.mask
   20 4 Object[] P1C1QueueOriginal3.buffer
   24 4 AtomicLong P1C1QueueOriginal3.tail
   28 4 AtomicLong P1C1QueueOriginal3.head
   32 4 PaddedLong P1C1QueueOriginal3.tailCache
   36 4 PaddedLong P1C1QueueOriginal3.headCache
40 (object boundary, size estimate)

But that is not the whole picture, we now had 4 new objects (2 of each class below) referred from the above object:

uk.co.real_logic.queues.P1C1QueueOriginal3$PaddedLong
offset size type description
0 12 (assumed to be the object header + first field alignment)
12 4 (alignment/padding gap)
16 8 long PaddedLong.value
24-64 8 long PaddedLong.p1-p6
72 (object boundary, size estimate)

uk.co.real_logic.queues.PaddedAtomicLong
offset size type description
0 12 (assumed to be the object header + first field alignment)
12 4 (alignment/padding gap)
16 8 long AtomicLong.value
24-64 8 long PaddedAtomicLong.p1-p6
72 (object boundary, size estimate)

These counters are different from the original because they are padded (on one side at least), but also they represent a different overall memory layout/access pattern. The counters are now floating. They can get relocated at the whim of the JVM. Given these are in all probability long lived objects I
wouldn't think they move much after a few collections, but each collection presents a new opportunity to shuffle them about.
In my last post I explored 2 directions at once:

I fixed all the remaining false-sharing potential by padding the counters, the class itself and the data in the array.
I inlined the counters back into the queue class, in a way returning to the original layout (with all the trimmings we added later), but with more padding.

I ended up with the following layout:

psy.lob.saw.queues.spsc4.SPSCQueue4
offset size type description
0 12 (assumed to be the object header + first field alignment)
12 4 (alignment/padding gap)
16-72 8 long L0Pad.p00-07
80 4 int ColdFields.capacity
84 4 int ColdFields.mask
88 4 Object[] ColdFields.buffer
92 4 (alignment/padding gap)
96-144 8 long L1Pad.p10-16
152 8 long TailField.tail
160-208 8 long L2Pad.p20-26
216 8 long HeadCache.headCache
224-272 8 long L3Pad.p30-36
280 8 long HeadField.head
288-336 8 long L4Pad.p40-46
344 8 long TailCache.tailCache
352-400 8 long L5Pad.p50-56
408 (object boundary, size estimate)

This worked well, reducing run to run variance almost completely and delivering good performance. The problem was it was failing to hit the highs of the original implementation and variants, particularly when running across cores.
To further explore the difference between the inlined and floating versions I went back and applied the full padding treatment to the floating counters version. This meant replacing PaddedLong and PaddedAtomicLong with fully padded implementations, adding padding around the class fields and padding the data. The full code is here, it's very similar to what we've done to pad the other classes. The end result has the following layout:

psy.lob.saw.queues.spsc.fc.SPSPQueueFloatingCounters4
offset size type description
0 12 (assumed to be the object header + first field alignment)
   12 4 (alignment/padding gap)
   16-72 8 long SPSPQueueFloatingCounters4P0.p00-p07
   80 4 int SPSPQueueFloatingCounters4Fields.capacity
   84 4 int SPSPQueueFloatingCounters4Fields.mask
   88 4 Object[] SPSPQueueFloatingCounters4Fields.buffer
   92 4 VolatileLongCell SPSPQueueFloatingCounters4Fields.tail
   96 4 VolatileLongCell SPSPQueueFloatingCounters4Fields.head
100 4 LongCell SPSPQueueFloatingCounters4Fields.tailCache
104 4 LongCell SPSPQueueFloatingCounters4Fields.headCache
108 4 (alignment/padding gap)
112-168 8 long SPSPQueueFloatingCounters4.p10-p17
  176 (object boundary, size estimate)

psy.lob.saw.queues.spsc.fc.LongCell
offset size type description
0 12 (assumed to be the object header + first field alignment)
   12 4 (alignment/padding gap)
   16-64 8 long LongCellP0.p0-p6
   72 8 long LongCellValue.value
   80-128 8 long LongCell.p10-p16
  136 (object boundary, size estimate)

psy.lob.saw.queues.spsc.fc.VolatileLongCell
offset size type description
0 12 (assumed to be the object header + first field alignment)
   12 4 (alignment/padding gap)
   16-64 8 long VolatileLongCellP0.p0-p6
   72 8 long VolatileLongCellValue.value
   80-128 8 long VolatileLongCell.p10-p16
  136 (object boundary, size estimate)

If we cared more about memory consumption we could count the object header as padding and pad with integers to avoid the alignment gaps. I'm not that worried, so I won't. Note that the floating counters have to consume double the padding required for the flattened counters as they have no guarantees of their neighbours on either side. In the interest of comparing the impact of the data padding separately I also implemented a none data padding version.

Which one is more better?

While the charts produced previously are instructive and good at highlighting the variance, they make the post very bulky, so this time we'll try a table. The data with the charts is available here for those who prefer them. I've expanded the testing of JVM parameters a bit to cover the effect of the combinations of the 3 options used before, otherwise the method and setup are the same. The abbreviations stand for the following:

O1 - original lock-free SPSC queue with the original padding.
FC3 - Floating Counters, fully padded, data is not padded.
FC4 - Floating Counters, fully padded, data is padded.
I3 - Inlined Counters, fully padded, data is not padded.
I4 - Inlined Counters, fully padded, data is padded.
CCM - using the -XX:+UseConcCardMark flag
CT - using the -XX:CompileThreshold=100000 flag
NUMA - using the -XX:+UseNUMA flag

Same Core(each cell is min, mid, max, all numbers are in millions of ops/sec, bold is best result, red is best overall):

	No Opts	CCM	CT	NUMA	CCM+CT	CCM+NUMA	CT+NUMA	CCM+CT+NUMA
O1	107,108,113	97,102,103	111,112,112	88,108,113	97,103,103	100,102,103	111,112,112	102,103,103
FC3	105,108,113	102,103,104	111,112,113	103,108,113	102,103,103	97,103,103	111,112,113	102,103,103
FC4	93, 96,101	87, 89,90	99,100,101	82, 95,100	89, 90,90	86, 90,90	81,101,101	89, 90,90
I3	103,123,130	97, 98,100	129,129,130	122,129,130	105,106,107	97, 98,99	128,129,130	104,105,107
I4	108,113,119	103,105,106	111,113,113	99,118,120	105,113,114	98,104,114	112,112,113	105,114,114

Cross Core:

	No Opts	CCM	CT	NUMA	CCM+CT	CCM+NUMA	CT+NUMA	CCM+CT+NUMA
O1	38, 90,130	57, 84,136	40, 94,108	38, 59,118	58, 98,114	55,113,130	39,53,109	57,96,112
FC3	94,120,135	109,141,154	94,105,117	98,115,124	103,115,128	116,129,140	95,108,118	102,120,127
FC4	106,124,132	118,137,152	104,113,119	107,119,130	107,126,133	114,132,150	105,114,119	99,123,131
I3	86,114,122	88,113,128	72, 96,111	90,112,123	86, 98,108	85,117,125	88,96,111	90,99,108
I4	49,107,156	78,133,171	58,100,112	48,126,155	108,128,164	88,143,173	55,96,113	104,115,164

I leave it to you to read meaning into the results, but my take aways are as follows:

Increasing the the CompileThreshold is not equally good for all code in all cases. In the data above it is not proving helpful of and by itself in the cross core case for any implementation. It does seem to help once CCM is thrown in as well.
Using ConCardMark makes a mighty difference. It hurts performance on the single core, but greatly improves the cross core case for all implementations. The difference made by CCM is covered by Dave Dice here and it goes a long way to explain the variance experienced in the inlined versions when running without it.
NUMA makes little difference that I can see to the above cases. This is as expected since the code is being run on the same NUMA node throughout. Running across NUMA nodes we might see a difference.
As you can see there is still quite a bit of instability going on, though as an overall recommendation thus far I'd say I4 is the winner. FC4 is not that far behind when you consider the mid results to be the most representative. It also offers more stable overall results in terms of the variance.
173M ops/sec! it's a high note worth musing over... But... didn't I promise you 250M? I did, you'll have to wait.

The above results also demonstrates that data inlining is a valid optimization with measurable benefits. I expect the results in more real-life scenarios to favor the inlined version even more as it offers better data locality and predictable access over the floating fields variants.
One potential advantage for the floating counters may be available should we be able to allocate the counters on their writer threads. I have not explored this option, but based on Dave Dice's observations I expect some improvement. This will make the queue quite awkward to set up, but worth a go.
There's at least one more post coming on this topic, considering the mechanics of the experiment and their implications. And after that? buggered if I know. But hang in there, it might become interesting again ;-)

UPDATE: thanks Chris Vest for pointing out I left out the link to the data, fixed now.

16 comments:

Unknown23 Jul 2013, 13:27:00
looking forward to the next post =)
ReplyDelete
Replies
Anonymous1 Aug 2013, 10:07:00
Hi Nitsan,

have you checked out Fast-Flow?
http://calvados.di.unipi.it/fastflow

I'm in the process of creating a java port (for x86 / AMD64 only at the moment) of fast flow.

Their core structure is a bound FIFO SPSC queue dubbed a Buffer.

When correctly cache aligned (data and control variables) I manage over 400M queue/dequeue (synthetic) operations within a second on an i7 3720 2.6 GHz (Sun JDK 1.7, default settings).
This gets reduced to about 150M when using actual data due to memory pressure.

ReplyDelete
Replies
Anonymous13 Sept 2013, 19:28:00
Hi Nitsan,

Good article regarding the meaning of No-Op.
I checked out your code and test harness.
When I remove the Thread.yield() methods from the Producer and Consumer in QueuePerfTest (to simulate busy spinning), I end up with occasional JVM crashes in both FFBuffer, FFBufferOrdered and SPSCQueue5.
After looking at the assemblies it appears that the C2 compiler is optimizing and rearranging instructions as it pleases regardless of using Unsafe.putOrderd* or volatile keywords.
Not that the ASM it generates is wrong, it just appears to end up in an endless loop.
Would you know a means to disallow such optimizations?

Regards,
Pressenna
ReplyDelete
Replies

Note: only a member of this blog may post a comment.