Tuesday, 23 July 2013

SPSC Revisited - part II: Floating vs Inlined Counters

Continued from here, we examine the comparative performance of 2 approaches to implementing queue counters.
If we trace back through the journey we've made with our humble SPSC queue you'll note a structural back and forth has happened. In the first post, the first version had the following layout (produced using the excellent java-object-layout tool):
uk.co.real_logic.queues.P1C1QueueOriginal1
 offset  size     type description
      0    12          (assumed to be the object header + first field alignment)
     12     4 Object[] P1C1QueueOriginal1.buffer
     16     8     long P1C1QueueOriginal1.tail
     24     8     long P1C1QueueOriginal1.head
     32                (object boundary, size estimate)
We went on to explore various optimisations which were driven by our desire to:
  1. Replace volatile writes with lazySet
  2. Reduce false sharing of hot fields
To achieve these we ended up extracting the head/tail counters into their own objects and the layout turned to this:
uk.co.real_logic.queues.P1C1QueueOriginal3
 offset  size       type description
      0    12            (assumed to be the object header + first field alignment)
     12     4        int P1C1QueueOriginal3.capacity
     16     4        int P1C1QueueOriginal3.mask
     20     4   Object[] P1C1QueueOriginal3.buffer
     24     4 AtomicLong P1C1QueueOriginal3.tail
     28     4 AtomicLong P1C1QueueOriginal3.head
     32     4 PaddedLong P1C1QueueOriginal3.tailCache
     36     4 PaddedLong P1C1QueueOriginal3.headCache
     40                  (object boundary, size estimate)
But that is not the whole picture, we now had 4 new objects (2 of each class below) referred from the above object:
uk.co.real_logic.queues.P1C1QueueOriginal3$PaddedLong
 offset  size type description
      0    12      (assumed to be the object header + first field alignment)
     12     4      (alignment/padding gap)
     16     8 long PaddedLong.value
     24-64  8 long PaddedLong.p1-p6
     72            (object boundary, size estimate)
 
uk.co.real_logic.queues.PaddedAtomicLong
 offset  size type description
      0    12      (assumed to be the object header + first field alignment)
     12     4      (alignment/padding gap)
     16     8 long AtomicLong.value
     24-64  8 long PaddedAtomicLong.p1-p6
     72            (object boundary, size estimate)
These counters are different from the original because they are padded (on one side at least), but also they represent a different overall memory layout/access pattern. The counters are now floating. They can get relocated at the whim of the JVM. Given these are in all probability long lived objects I
wouldn't think they move much after a few collections, but each collection presents a new opportunity to shuffle them about.
In my last post I explored 2 directions at once:
  1. I fixed all the remaining false-sharing potential by padding the counters, the class itself and the data in the array.
  2. I inlined the counters back into the queue class, in a way returning to the original layout (with all the trimmings we added later), but with more padding.
I ended up with the following layout:
psy.lob.saw.queues.spsc4.SPSCQueue4
 offset  size     type description
      0    12          (assumed to be the object header + first field alignment)
     12     4          (alignment/padding gap)
     16-72  8     long L0Pad.p00-07
     80     4      int ColdFields.capacity
     84     4      int ColdFields.mask
     88     4 Object[] ColdFields.buffer
     92     4          (alignment/padding gap)
     96-144 8     long L1Pad.p10-16
    152     8     long TailField.tail
    160-208 8     long L2Pad.p20-26
    216     8     long HeadCache.headCache
    224-272 8     long L3Pad.p30-36
    280     8     long HeadField.head
    288-336 8     long L4Pad.p40-46
    344     8     long TailCache.tailCache
    352-400 8     long L5Pad.p50-56
    408                (object boundary, size estimate)
This worked well, reducing run to run variance almost completely and delivering good performance. The problem was it was failing to hit the highs of the original implementation and variants, particularly when running across cores.
To further explore the difference between the inlined and floating versions I went back and applied the full padding treatment to the floating counters version. This meant replacing PaddedLong and PaddedAtomicLong with fully padded implementations, adding padding around the class fields and padding the data. The full code is here, it's very similar to what we've done to pad the other classes. The end result has the following layout:
psy.lob.saw.queues.spsc.fc.SPSPQueueFloatingCounters4
 offset  size             type description
      0    12                  (assumed to be the object header + first field alignment)
     12     4                  (alignment/padding gap)
     16-72  8             long SPSPQueueFloatingCounters4P0.p00-p07
     80     4              int SPSPQueueFloatingCounters4Fields.capacity
     84     4              int SPSPQueueFloatingCounters4Fields.mask
     88     4         Object[] SPSPQueueFloatingCounters4Fields.buffer
     92     4 VolatileLongCell SPSPQueueFloatingCounters4Fields.tail
     96     4 VolatileLongCell SPSPQueueFloatingCounters4Fields.head
    100     4         LongCell SPSPQueueFloatingCounters4Fields.tailCache
    104     4         LongCell SPSPQueueFloatingCounters4Fields.headCache
    108     4                  (alignment/padding gap)
    112-168 8             long SPSPQueueFloatingCounters4.p10-p17
    176                        (object boundary, size estimate)

psy.lob.saw.queues.spsc.fc.LongCell
 offset  size type description
      0    12      (assumed to be the object header + first field alignment)
     12     4      (alignment/padding gap)
     16-64  8 long LongCellP0.p0-p6
     72     8 long LongCellValue.value
     80-128 8 long LongCell.p10-p16
    136            (object boundary, size estimate)

psy.lob.saw.queues.spsc.fc.VolatileLongCell
 offset  size type description
      0    12      (assumed to be the object header + first field alignment)
     12     4      (alignment/padding gap)
     16-64  8 long VolatileLongCellP0.p0-p6
     72     8 long VolatileLongCellValue.value
     80-128 8 long VolatileLongCell.p10-p16
    136            (object boundary, size estimate)
If we cared more about memory consumption we could count the object header as padding and pad with integers to avoid the alignment gaps. I'm not that worried, so I won't. Note that the floating counters have to consume double the padding required for the flattened counters as they have no guarantees of their neighbours on either side. In the interest of comparing the impact of the data padding separately I also implemented a none data padding version.

Which one is more better?

While the charts produced previously are instructive and good at highlighting the variance, they make the post very bulky, so this time we'll try a table. The data with the charts is available here for those who prefer them. I've expanded the testing of JVM parameters a bit to cover the effect of the combinations of the 3 options used before, otherwise the method and setup are the same. The abbreviations stand for the following:
  • O1 - original lock-free SPSC queue with the original padding.
  • FC3 - Floating Counters, fully padded, data is not padded.
  • FC4 - Floating Counters, fully padded, data is padded.
  • I3 - Inlined Counters, fully padded, data is not padded.
  • I4 - Inlined Counters, fully padded, data is padded.
  • CCM - using the -XX:+UseConcCardMark flag
  • CT - using the -XX:CompileThreshold=100000 flag
  • NUMA - using the -XX:+UseNUMA flag
Same Core(each cell is min, mid, max, all numbers are in millions of ops/sec, bold is best result, red is best overall):
No OptsCCMCTNUMACCM+CTCCM+NUMACT+NUMACCM+CT+NUMA
O1107,108,11397,102,103111,112,11288,108,11397,103,103100,102,103111,112,112102,103,103
FC3105,108,113102,103,104111,112,113103,108,113102,103,10397,103,103111,112,113102,103,103
FC493, 96,10187, 89,9099,100,10182, 95,10089, 90,9086, 90,9081,101,10189, 90,90
I3103,123,13097, 98,100129,129,130122,129,130105,106,10797, 98,99128,129,130104,105,107
I4108,113,119103,105,106111,113,11399,118,120105,113,11498,104,114112,112,113105,114,114

Cross Core:
No OptsCCMCTNUMACCM+CTCCM+NUMACT+NUMACCM+CT+NUMA
O138, 90,13057, 84,13640, 94,10838, 59,11858, 98,11455,113,13039,53,10957,96,112
FC394,120,135109,141,15494,105,11798,115,124103,115,128116,129,14095,108,118102,120,127
FC4106,124,132118,137,152104,113,119107,119,130107,126,133114,132,150105,114,11999,123,131
I386,114,12288,113,12872, 96,11190,112,12386, 98,10885,117,12588,96,11190,99,108
I449,107,15678,133,17158,100,11248,126,155108,128,16488,143,17355,96,113104,115,164

I leave it to you to read meaning into the results, but my take aways are as follows:
  • Increasing the the CompileThreshold is not equally good for all code in all cases. In the data above it is not proving helpful of and by itself in the cross core case for any implementation. It does seem to help once CCM is thrown in as well.
  • Using ConCardMark makes a mighty difference. It hurts performance on the single core, but greatly improves the cross core case for all implementations. The difference made by CCM is covered by Dave Dice here and it goes a long way to explain the variance experienced in the inlined versions when running without it. 
  • NUMA makes little difference that I can see to the above cases. This is as expected since the code is being run on the same NUMA node throughout. Running across NUMA nodes we might see a difference.
  • As you can see there is still quite a bit of instability going on, though as an overall recommendation thus far I'd say I4 is the winner. FC4 is not that far behind when you consider the mid results to be the most representative. It also offers more stable overall results in terms of the variance.
  • 173M ops/sec! it's a high note worth musing over... But... didn't I promise you 250M? I did, you'll have to wait.
The above results also demonstrates that data inlining is a valid optimization with measurable benefits. I expect the results in more real-life scenarios to favor the inlined version even more as it offers better data locality and predictable access over the floating fields variants.
One potential advantage for the floating counters may be available should we be able to allocate the counters on their writer threads. I have not explored this option, but based on Dave Dice's observations I expect some improvement. This will make the queue quite awkward to set up, but worth a go.
There's at least one more post coming on this topic, considering the mechanics of the experiment and their implications. And after that? buggered if I know. But hang in there, it might become interesting again ;-)

UPDATE: thanks Chris Vest for pointing out I left out the link to the data, fixed now.


16 comments:

  1. looking forward to the next post =)

    ReplyDelete
    Replies
    1. It's a coming :-), getting very good results

      Delete
  2. Hi Nitsan,

    have you checked out Fast-Flow?
    http://calvados.di.unipi.it/fastflow

    I'm in the process of creating a java port (for x86 / AMD64 only at the moment) of fast flow.

    Their core structure is a bound FIFO SPSC queue dubbed a Buffer.

    When correctly cache aligned (data and control variables) I manage over 400M queue/dequeue (synthetic) operations within a second on an i7 3720 2.6 GHz (Sun JDK 1.7, default settings).
    This gets reduced to about 150M when using actual data due to memory pressure.

    ReplyDelete
    Replies
    1. I'm aware of FastFlow, but not had a look at their code. The result you quote is great, can you please share:
      1. The code
      2. The benchmark harness
      I'd be keen to run the same code within the same conditions and see what gives. The harness is particularly important as it is key that we measure the same thing under the same conditions. I'm in the process of writing a follow up to this post exploring the different results you get from tweaking the harness and the significance of such tweaks to real world behaviour.
      Also not clear what you mean by synthetic vs actual data.
      I'm very happy about your comment, one of the main principles I'm trying to promote via this blog is peer review and an open approach to implementation and results :-)

      Delete
    2. Find sample code at http://pastebin.com/46GeLzdH (GPL)

      I haven't finished the FF/j lib yet, but multi producer performance seems pretty decent via FF.

      The synth benchmark is simply pushing the same object between two threads via the FFBuffer.
      When actual data (e.g. create a new item to be pushed to queue via new operator), performance drops considerably, but assuming you want to only pass longs around, then this should give you an adequate estimate.

      Delete
    3. Having looked through the code, here's some issues I found:
      1. The writes to head/tail (or pread/pwrite as they are named in your impl) are plain writes, as such they offer no happens before relationship. This means the code is very likely (at the mercy of JIT compilation) incorrect and can lead to elements being dropped. Note in the original code(http://sourceforge.net/p/mc-fastflow/code/62/tree/ff/buffer.hpp#l26) the calls to WMB() that is write memory barrier, your code is missing that and that is why you need either a volatile write or a lazySet/putOrdered.
      2. The padding around data is not actually padding the data. In Java the data is not inlined into the top struct layout, it's an array. An array is a reference to the array object allocated outside of the instance.
      3. The index is an int. This means it will wrap pretty quickly. The original code handles this eventuality while yours doesn't. MAX_INT is not such a big number if you consider a system that needs to run for a few months without a restart.
      4. The test code re-allocates the buffer on each iteration. This gives you a false impression of the memory layout. Queues are normally long standing objects and get packed into the perm gen early in the life of a system. This impacts runtime behaviour.
      5. You are not actually doing anything with the value you read. This means that JIT is open to do some pretty unrealistic (in the sense that people are not ever likely to use the code that way) optimizations on your code.
      6. As you are not using the inheritance method to enforce field ordering you are more open to fields being reordered. See previous posts on memory layout for more information on that.
      I don't mean to be mean, or discourage you in any way. Maybe your code will still perform very well once these corrections have been implemented, but as it is I cannot properly compare it to a valid queue implementation.
      Thanks for your interest, and for pointing me at the FastFlow paper, very interesting.
      Regards,
      Nitsan

      Delete
    4. Let me try to answer

      Restriction: The queue is only designed for X84 / AMD64 systems.

      1. WMB() is a noop on X86 / AMD64 systems as it is stated in the original code, hence the limitation.
      2. I know, the padding is into the data array, which further evades false sharing
      3. Wrapping is not a problem and considered (two's complement), it's a bound circular buffer
      4. Again, not a problem, as the values in the queue are set to null on dequeue
      5. That is true: It actually behaves in the same manner as your QueuePerfTest.java
      That is also the reason I call this a synthetic test.
      As said, when I insert i (the counter) and build the sum of the values, the performance drops to about 150M
      6. I understand that, I've only considered upto OpenJDK (till 7-b127), not sure how it behaves afterwards, have to check that, but I doubt that the performance of the code will be affected.

      I understand your concern, but the buffer does behave exactly like the original C++ version.
      It does not wrap and stops accepting new items once it gets full.
      That is also the reason, why the designers of this queue call it wait free, as it does not rely on shared CASed values.


      See http://pastebin.com/YJkHZzZQ for revised version.
      This one implements the Queue interface, so you should be able to drop it into your test harness.
      I've changed the embedded test case, which now builds the sum and verifies its correctness.

      Cheers,
      Pressenna

      Delete
    5. 1. WMB is a STORESTORE barrier, which is a noop in the sense that it is a MOV like any other move, but is not the same as not having a barrier. It's a compiler instruction to stop writes from getting re-ordered. Plain writes are not a memory barrier. WMB translates to __asm__ __volatile__ ("": : :"memory"), not an empty line. This is a compiler memory barrier instruction(http://en.wikipedia.org/wiki/Memory_ordering#Compiler_memory_barrier). If you are not using a memory barrier your implementation is potentially broken.
      2. How is padding 'into the array'? you are padding around the reference to the array not actually around the array.
      3. Fair point for an implementation which does not implement size. If size was implemented running out of ints would be a problem. My bad.
      4. Your test was different because the queue was allocated on each test run, not reused across all the run iterations. In my testing this has made a difference. I see you have changed that. Now that you've implemented Queue I can slot it into the tests I have. Also note that the QueuePerfTest is measuring including the thread creation and until thread.join, and that running the publisher ahead of the client instead of the other way around is also significant to the results. QueuePerfTest was written by Martin Thompson for a presentation/demonstration, it's not perfect but to compare results you must have the same experiment going.
      5. QueuePerfTest keeps the last value dequeued and prints it so that JIT can't just ignore the values or optimize away reading them. Your test never examined the result. Now it does.
      6. If the memory layout changes you might be exposed to false sharing and then performance will change.
      The queues presented in this series of posts are all lock-free and wait free, there is not a CAS in sight :-) The original implementation I took from Martin's presentation and played with it. He has other tricks up his sleeves, I picked this algorithm as he made it open source. I'm very glad you are dragging me into another variation on this topic :-) I like it! Side stepping the size and using the null check instead is brilliant. I think that it will perform just as well with the write barrier corrected. The size function should also be straightforward enough to implement at that point. NEW TOYS!!! :-)
      Thanks alot, this is very interesting (for me anyway).

      Delete
    6. Hi, good morning.

      1. The WMB is a compiler memory barrier, which is a NOOP on Intel/AMD CPUs.
      As they are out-of-order CPUs it is irrelevant for the implementation at hand.
      2. I am accessing the array in strides, which seems only important in single queue scenarios.
      Doing this reduces false sharing and boost performance by reading aligned memory locations.
      3. a size() method only works as a hint as it's value constantly changes.
      So reducing it to empty() / not_empty() saves a lot of resources and avoid sharing "hot" states between threads.
      4. I wasn't able to measure any difference in the micro benchmarks, so not sure.
      5. Adjusted it - again - I was not able to measure any difference in local tests.
      6. That might be true


      "FastFlow (斋戒流) is a parallel programming framework for multi-core platforms based upon non-blocking lock-free/fence-free synchronization mechanisms.". FastFlow's buffer (queue) is fence free, which you pointed out in #1.

      #1 and #2 seems to cater for the primary performance of this queue implementation.
      Since FastFlow is built ontop of this essential structure, I am looking for the best possible SPSC queue out there for the Java port. I might have to use another queue implementation on a different CPU platform, but that will be hidden behind an interface.

      This is also of great interest to me and it helps me a lot discussing these matters with you.

      Delete
    7. Ok, mid way through first coffee... will try and make sense:
      1. If you have no memory barrier then you are not publishing to your queue correctly and the consumer can see half initialized state from producer. WMB is a CPU noop but also compiler instruction to not re-order code. I can't see how it would be irrelevant to any queue implementation.
      2. Access in strides also means you are getting less throughput per stride from CPU cache line fetching. If you set the POW to 4 you'll get bad throughput.
      3. Size is a feature. It's like reading any unlocked shared data. This may be a requirement for monitoring. Having a size method doesn't mean you have to use it in offer/poll.
      4. Try forcing GC between iterations. This should help shake things into place.
      5. OK, I'm doing a writeup on the differences I see in the next post, so won't go into this further.
      Fence (as in full fence instructions) free is not memory barrier free. See here for the FF WMB: http://sourceforge.net/p/mc-fastflow/code/62/tree/ff/sysdep.h
      Thanks :-)

      Delete
    8. 1. http://en.wikipedia.org/wiki/Memory_ordering
      "These barriers prevent a compiler from reordering instructions, they do not prevent reordering by CPU."
      If I understand this correctly, the CPU might reorder the write instructions thus ignoring the order.
      AFAIK, a mfence is required to prohibit the CPU from reordering, which is not used anywhere here.
      2. I know, this is a very tweaky parameter as it drops by 2 in performance when changed from 2.
      3. I guess you are right. but I think empty/not empty might be an acceptable alternative.
      4. I still don't see any difference locally
      5. waiting for it and the harness :)

      Delete
    9. 1. For Java a STORE/STORE barrier is required, which is putOrdered. This is also described in the JMM as a no-op, but the ordering inhibition from the JIT compiler is significant. I'm not sure an mfence is required, this is how WMB are handled in Java... I've added your version and a version using putOrdered to the examples on Git and to the harness (if this is an issue let me know and I'll remove your code). Have a look.
      4. Maybe I been sniffing too much glue ;-). I will re-test at some point. For reference see on GitHub QueuePerfTest/ReversedQueuePerfTest/FairQueuePerfTest for variations on the theme. I'm getting the most consistent results with the Fair option. Your implementation is 9, the ordered variation is 91.
      5. The code is all there if you want a spoiler (just test for yourself), it'll be another week at the very least. Be sure to compare the relative performance when running on the same core and across cores.
      Once putOrdered is in place I found the new method to produce very similar results to the existing inlined variants. Currently 45 is looking like the best option(SPSCQueue5).

      Delete
    10. Pressana, this discussion urged me to write a post clarifying the "no-op" term used in the JMM for these memory barriers:
      http://psy-lob-saw.blogspot.com/2013/08/memory-barriers-are-not-free.html
      Hope it helps the explanation above.

      Delete
  3. Hi Nitsan,

    Good article regarding the meaning of No-Op.
    I checked out your code and test harness.
    When I remove the Thread.yield() methods from the Producer and Consumer in QueuePerfTest (to simulate busy spinning), I end up with occasional JVM crashes in both FFBuffer, FFBufferOrdered and SPSCQueue5.
    After looking at the assemblies it appears that the C2 compiler is optimizing and rearranging instructions as it pleases regardless of using Unsafe.putOrderd* or volatile keywords.
    Not that the ASM it generates is wrong, it just appears to end up in an endless loop.
    Would you know a means to disallow such optimizations?

    Regards,
    Pressenna

    ReplyDelete
    Replies
    1. 1. How do you mean crash? core dump? hang? I can understand hanging but core dumping on FFBuffer would indicate a JVM bug as it is playing within the limits of the language(i.e. not using Unsafe/JNI)
      2. I can't reproduce the issue you describe. I'm running on a recent i7 4500MQ/Ubuntu13.04/JDK1.7u40 and none of the implementations hang/crash.
      3. The WMB doesn't stop re-ordering completely. I limits re-ordering such that other writes can't be moved from before the barrier to after the barrier. If this is the issue you see then this is a JVM bug. I'll have a dig around the assemblies but I'd be very surprised if that is the case. There's no means to disallow these optimizations beyond using memory barriers, which you report not to work for you.
      If you have a look at the latest code you'll notice there are later variations on the implementations. I'm still thinking about some of the details in terms of correctness so don't assume they are all 100% :-)
      Finally, I feel this discussion is probably better pursued by e-mail. I'm on nitsanw ._. at ._. yahoo ._. com let's talk.
      Thanks,
      Nitsan

      Delete
    2. Correction, after many repeat runs I can see it hang for some of the FFBuffer variations, but none of the SPSC variations. I suspect this is due to the fact I kept to the original assumption that only a write memory barrier is required (in some of the variations). Where there is a matching read memory barrier I don't see it hang.

      Delete