Sunday, 17 March 2013

Single Producer/Consumer lock free Queue step by step

{This post is part of a long running series on lock free queues, checkout the full index to get more context here}
Reading through a fine tuned lock free single producer/consumer queue. Working through the improvements made and their respective impact.
Back in November, Martin Thompson posted a very nice bit of code on GitHub to accompany his Lock Free presentation. It's a single producer/consumer queue that is very fast indeed.  His repository includes the gradual improvements made, which is unusual in software, and offers us some insights into the optimization process. In this post I break down the gradual steps even further, discuss the reasoning behind each step and compare the impact on performance. My fork of his code is to be found here and contains the material for this post and the up and coming post. For this discussion focus on the P1C1QueueOriginal classes.
The benchmarks for this post were run on a dual socket Intel(R) Xeon(R) CPU E5-2630 @ 2.30GHz machine running Linux and OpenJdk 1.7.09. A number of runs were made to unsure measurement stability and then a represantative result from those runs was taken. Affinity was set using taskset. Running on same core is pinned to 2 logical cores on the same physical core. Across cores means pinned to 2 different physical cores on the same socket. Across sockets means pinned to 2 different cores on different sockets.

Starting point: Lock free, single writer principle

The initial version of the queue P1C1QueueOriginal1, while very straight forward in it's implementation already offers us a significant performance improvement and demonstrates the important Single Writer Principle. It is worth while reading and comparing offer/poll with their counter parts in ArrayBlockingQueue.
Running the benchmark for ArrayBlockingQueue on same core/across cores/across sockets yields the following result (run the QueuePerfTest with parameter 0):
same core      - ops/sec=9,844,983
across cores   - ops/sec=5,946,312 [I got lots of variance on this on, took an average]
across sockets - ops/sec=2,031,953

We expect this degrading scale as we pay more and more for cache traffic. These result are our out of the box JDK baseline.
When we move on to our first cut of the P1C1 Queue we get the following result (run the QueuePerfTest with parameter 1):
same core      - ops/sec=24,180,830[2.5xABQ]
across cores   - ops/sec=10,572,447[~2xABQ]
across sockets - ops/sec=3,285,411[~1.5xABQ]

Jumping jelly fish! Off to a good start with large improvements on all fronts. At this point we have gained speed at the expense of limiting our scope from multi producer/consumer to single producer/consumer. To go further we will need to show greater love for the machine. Note that the P1C1QueueOriginal1 class is the same as Martin's OneToOneConcurrentArrayQueue.

Lazy loving: lazySet replaces volatile write

As discussed previously on this blog, single writers can get a performance boost by replacing volatile writes with lazySet. We replace the volatile long fields for head and tail with AtomicLong and use get for reads and lazySet for writes in P1C1QueueOriginal12. We now get the following result (run the QueuePerfTest with parameter 12):
same core      - ops/sec=48,879,956[2xP1C1.1]
across cores   - ops/sec=30,381,175[3xP1C1.1 large variance, average result]
across sockets - ops/sec=10,899,806[3xP1C1.1]

As you may or may not recall, lazySet is a cheap volatile write such that it provides happens-before guarantees for single writers without forcing a drain of the store buffer. This manifests in this case as lower overhead to the thread writing, as well as reduced cache coherency noise as writes are not forced through.

Mask of power: use '& (k pow 2) - 1 instead of %

The next improvement is replacing the modulo operation for wrapping the array index location with a bitwise mask. This 'trick' is also present in ring buffer implementations, Cliff Click CHM and ArrayDeque. Combined with the lazySet improvement this version is Martin's OneToOneConcurrentArrayQueue2 or P1C1QueueOriginal2 in my fork.
The result (run the QueuePerfTest with parameter 2):
same core      - ops/sec=86,409,484[1.9xP1C1.12]
across cores   - ops/sec=47,262,351[1.6xP1C1.12 large variance, average result]
across sockets - ops/sec=11,731,929[1.1xP1C1.12]

We made a minor trade off here, forcing the queue to have a size which is a power of 2 and we got some excellent mileage out of it. The modulo operator is quite expensive both in terms of cost and in terms of limiting instruction throughput and it is a trick worth employing when the opportunity rises.
So far our love for the underlying architecture is expressed by offering cheap alternatives for expensive instructions. The next step is sympathy for the cache line.
[UPDATE 3/11/2014: See further focused investigation into the merits of different modulo implementations here]

False sharing

False sharing is described elsewhere(here, and later here, and more recently here where the coming of a solution by @Contended annotation is discussed). To summarize, from the CPU cache coherency  system perspective if a thread writes to a cache line then it 'owns' it, if another thread then needs to write to the same line they need to exchange ownership. When this happens for writes into different locations the sharing is 'falsely' assumed by the CPU and time is wasted on the exchange. The next step of improvement is made by padding the head and tail fields such that they are not on the same cache line in P1C1QueueOriginal21.
The result (run the QueuePerfTest with parameter 21):
same core      - ops/sec=88,709,910[1.02xP1C1.2]
across cores   - ops/sec=52,425,315[1.1xP1C1.2]
across sockets - ops/sec=13,025,529[1.2xP1C1.2]

This made less of an impact then previous changes. False sharing is a less deterministic side effect and may manifest differently based on luck of the draw memory placement. We expect code which avoids false sharing to have less variance in performance. To see the variation run the experiment repeatedly, this will result in different memory layout of the allocated queue.

Reducing volatile reads

Common myth regarding volatile reads is that they are free, the next improvement step shows that to be false. In P1C1QueueOriginal22 I have reversed the padding of the AtomicLong (i.e head and tail are back to being plain AtomicLong) and added caching fields for the last read value of head and tail. As these values are only used from a single thread (tailCache is used by consumer, headCache used by producer) they need not be volatile. Their only use is to reduce volatile reads to a minimum. Normal reads, unlike volatile reads are open to greater optimization and may end up in a register, volatile reads are never from a register (i.e always from memory).
The result (run the QueuePerfTest with parameter 22):
same core      - ops/sec=99,181,930[1.13xP1C1.2]
across cores   - ops/sec=80,288,491[1.6xP1C1.2]
across sockets - ops/sec=17,113,789[1.5xP1C1.2]

By Toutatis!!! This one is a cracker of an improvement. Not having to load the head/tail value from memory as volatile reads makes a massive difference.
The last improvement made is adding the same false sharing guard we had for the head and tail fields around the cache fields. This is required as these are both written at some point and can still cause false sharing, something we all tend to forget can happen to normal fields/data even if it is nothing to do with concurrency. I've added a further implementation P1C1QueueOriginal23 where only the cache fields are padded and not the head and tail. It makes for a slight further improvement, but as the head and tail are still suffering from false sharing it is not a massive step forward.

UPDATE(21/09/2013): As Martin argues in the comments below the above justification on the source of improvement to performance gained by adding the cache fields is incorrect. The performance improvement is caused mostly by the reduction in cache coherency traffic. For an expanded explanation see this later post here. Volatile reads are by no means free, and some of the improvement is due to the reduction in reads, but it is not the main reason for the improvement.

All together now!

The final version P1C1QueueOriginal3 packs together all the improvements made before:
  • lock free, single writer principle observed. [Trade off: single producer/consumer]
  • Set capacity to power of 2, use mask instead of modulo. [Trade off: more space than intended]
  • Use lazySet instead of volatile write.
  • Minimize volatile reads by adding local cache fields. [Trade off: minor size increment]
  • Pad all the hot fields: head, tail, headCache,tailCache [Trade off: minor size increment]
The result (run the QueuePerfTest with parameter 3):
same core      - ops/sec=110,405,940[1.33xP1C1.2]
across cores   - ops/sec=130,982,020[2.6xP1C1.2]
across sockets - ops/sec=19,338,354[1.7xP1C1.2]

To put these results in the context of the ArrayBlocking queue:
same core      - ops/sec=110,405,940[11xABQ]
across cores   - ops/sec=130,982,020[26xABQ]
across sockets - ops/sec=19,338,354[9xABQ]

This is great improvement indeed, hat off to Mr. Thompson.

Summary, and the road ahead

My intent in this post was to give context and add meaning to the different performance optimizations used in the queue implementation. At that I am just elaborating Martin's work further. If you find the above interesting I recommend you attend one of his courses or talks (if you can find a place).
During the course of running these experiments I encountered great variance between runs, particularly in the case of running across cores. I chose not to explore that aspect in this post and picked representative measurements for the sake of demonstration. To put it another way: your mileage may vary.
Finally, my interest in the queue implementation was largely as a data structure I could port off heap to be used as an IPC messaging mechanism. I have done that and the result are in my fork of Martin's code here. This post evolved out of the introduction to the post about my IPC queue, it grew to the point where I decided they would be better apart, so here we are. The IPC queue is working and achieves similar throughput between processes as demonstrated above... coming soon.
UPDATE: IPC post published here hitting 135M between processes
UPDATE: Run to run variance further explored and eliminated here

27 comments:

  1. Thanks for the detailed walk-through.

    One error I found in the text - in the section at the end labeled "To put these results in the context of the ArrayBlocking queue:" you repeat the values from the section immediately preceding it. I think you want the values from the top of the article, correct?

    Cheers - P

    ReplyDelete
    Replies
    1. The text is correct, if somewhat unclear...
      The result (run the QueuePerfTest with parameter 3): <-- Comparing Final cut to step 2
      same core - ops/sec=110,405,940[1.33xP1C1.2] <-- Final cut is 33% faster than step 2
      ...
      To put these results in the context of the ArrayBlocking queue: <-- Comparing Final cut to ArrayBlockingQueue
      same core - ops/sec=110,405,940[11xABQ] <-- Final cut is 1100% faster than ABQ
      ...
      Sorry, this seems to confuse a couple of people by now, blame my poor phrasing for not getting the message through.

      Delete
  2. Hello Nitsan!

    Nice post!

    Have you tried to run benchmarks with -XX:+UseNUMA JVM option for across sockets test?

    Also it is interesting how it would behave when don't allow for head index be to close to tail index (to minimize false sharing while accessing an array data)...

    ReplyDelete
    Replies
    1. Thanks :)
      I've not tried using the NUMA flag for the cross socket test, there's allot to consider in this case and I didn't want to make this article any longer/more involved. You are correct in assuming a relationship, but it is prehaps more complicated than you realize. Memory is typically allocated on the NUMA node first accessing the memory and not moved after. Ideally we would want the producer to own this slice of memory (see Dave Dice blog on the topic: https://blogs.oracle.com/dave/entry/numa_aware_placement_of_communication1). My main objective in covering the different affinity settings was to highlight there are large differences in behavior to be considered and as most people don't set the affinity of their threads they must assume a less favourable end position is likely.
      As for your second comment, yes you could do that and risk a minor memory leak. The false sharing is caused by the consumer nulling out the references to consumed elements. It is a trade off worth considering.

      Delete
  3. False Sharing effects are greatly amplified for sequentially consistent writes, aka Java volatile writes. Using lazySet() in this case weakens the memory model semantics and thus false sharing does not hurt as much.

    ReplyDelete
    Replies
    1. You are right, I completely looked over this subtlety. I sort of hint at it in the lazySet section but it is not at all clear. Apologies.

      Delete
  4. The "volatile read" is not so much the issue. The real issues come from the read of the head or tail from the opposite end will pretty much always result in a cache miss. This is because the other side is constantly changing it.

    ReplyDelete
    Replies
    1. True, but I expect the fact the head/tail cache fields are non-volatile leads to the JIT and CPU assuming single threaded access and allowing more aggressive optimization to the effect of hardly checking if the value is changed. I struggle to think of an experiment that would prove this the isolated impact off hand, did you have one in mind?

      Delete
    2. Optimisations can be checked by looking at the assembler produced. A volatile read on x86 is just a simple MOV from a memory address to register.

      The L2 cache misses are the big give away in this case.

      Delete
    3. Exactly, a volatile is a MOV from memory to register, where as the none volatile read may well get cached in a register and read from there. An experiment to demonstrate (I will carry out when I have some time):
      Given 2 adjacent locations in memory modify one and read the other from 2 different threads. Compare combinations of volatile/lazy/plain read/write.
      I expect the volatile read to be less performant than the plain alternative, but it won't be the first time I've been wrong ;)

      Delete
    4. With a real application and reasonable register pressure I expect them to be the same :-)

      Delete
    5. Unfair use of real world as basis for argument!!! :-) you're probably right, but I bet it would make a difference in a benchmark.

      Delete
    6. In a benchmark I'd expect to see a benefit from the processor applying "store forwarding" on lazySet on the same core as opposed to a SC write (volatile) if it is read again (volatile or not). The optimisation I'd expect for a normal read would be to hoist the field read out of a loop and register allocate it.

      Delete
  5. Great post Nitsan! Two suggestions:

    1. Have you tried nulling out the read values on the producer thread instead? That way you don't violate the single writer principle.

    2. Have you tried increasing the compilation threshold in the across-cores test to say -XX:CompileThreshold=100000? I've found that sometimes different thread timings cause different JVM optimisations (and hence a large variance in performance) so it helps to profile for a bit longer.

    ReplyDelete
    Replies
    1. Thanks, standing on the shoulders of giants and so on.
      1. I was presenting Martin's work and no further for this post. If I have time to explore nulling in the writer I will, but I'm afraid you run the risk of messing up the locality of the data when jumping from one end to the next. The nice thing about a more Disruptor-ish implementation is that it is not required. My next step from Martin's queue was in the direction of inlining the data being transfered, so no nulling was required.
      2. I will give the higher compilation option a go, thanks.

      Delete
  6. This comment has been removed by the author.

    ReplyDelete
  7. Many thanks for the great post!

    When I downloaded the code and ran the test of P1C1QueueOriginal3, I encounted something that I found very strange - when I accidentally added another command-line argument, the performance dropped dramatically!

    I constistently get 168 million ops/sec when running:
    java -cp bin uk.co.real_logic.queues.QueuePerfTest 3

    But if I add another argument, like
    java -cp bin uk.co.real_logic.queues.QueuePerfTest 3 4343

    then ops/sec drops to 52 million (consistently).

    I have tried this probably hundreds of times now, and it happens every time. Isn't this extremely weird? The second argument is never used in uk.co.real_logic.queues.QueuePerfTest. Any idea what could be happening here?

    ReplyDelete
    Replies
    1. Glad you liked it, there are a couple more in this series and at least one more to come, keep reading :)
      The length of the command line shouldn't effect performance, but on occasion does. This could be down to this bug: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7196911
      Alternatively the extra parameter allocation is skewing the other objects just right and triggering some false sharing.
      These things happen... :(

      Delete
  8. Many thanks for the great post, again !
    Especially thanks for the Mask of power.

    In my application, I received a 26 million operations per second.
    One operation - it is 100 bytes + 4 on the headline (> 670 int per second).
    Use MappedByteBuffer and ring buffer.

    CPU i7-2600 CPU@3.40GHz

    ReplyDelete
  9. Hi Nitsan.

    Really love your blog. It's amazing to explain low-level Java programming.

    But for P1C1QueueOriginal1.java, what makes P1C1QueueOriginal1 thread safe? Since there's just one producer one customer, why just use ArrayList instead?

    ReplyDelete
    Replies
    1. P1C1QueueOriginal1.java has a volatile head/tail fields as opposed to ArrayList (a better fit for a single threaded queue is ArrayDequeue). The write/reads to/from the volatile fields serve as memory barriers which induce the thread safety (for a single producer/consumer usecase).
      Glad you like it :-)

      Delete
  10. Hi Nitsan

    Would it be possible to post your gc settings you used to get the results for P1C1QueueOriginal3?

    Cheers

    ReplyDelete
    Replies
    1. I missed your comment, sorry about that. I think it was only '-XX:+UseCondCardMark'

      Delete
  11. hi nitsan
    in the poll method, the consumer thread , does a volatile read, (ie head.get() ) why is a volatile read needed here, since the consumer thread is the only writer of head variable

    ReplyDelete
    Replies
    1. You are right, it's not needed. It's just that AtomicLong does not offer us a non-volatile load alternative and this post does not go over to the dark side (i.e. using Unsafe). In later posts this is explored to a certain extent.

      Delete
  12. Why should volatile read always be from memory if the underlying architecture offers cache coherency?

    What happens if the cpu has the cache line in its cache and doesn't have any invalidate cache requests in its queue. If there was a write barrier placed before writing to memory, doesn't the read guarantee to be the latest copy?

    ReplyDelete
    Replies
    1. Read from memory -> from the memory subsystem -> closest cache and so on (subject to cache coherency etc). The x86 assembly would be MOV REG, [ADDRESS] (as opposed to allowing the compiler to somehow arrive at a MOV REG,REG interpretation of the read).
      "If there was a write barrier placed before writing to memory, doesn't the read guarantee to be the latest copy?" - please provide some context to your question, I'm not sure what you refer to... please keep in mind that you may have read this today, but I wrote this 2 years ago... a concrete example would be best. The write-barrier you refer to also needs defining: StoreStore or StoreLoad?

      Delete

Note: only a member of this blog may post a comment.