Sunday, 17 March 2013

Single Producer/Consumer lock free Queue step by step

{This post is part of a long running series on lock free queues, checkout the full index to get more context here}
Reading through a fine tuned lock free single producer/consumer queue. Working through the improvements made and their respective impact.
Back in November, Martin Thompson posted a very nice bit of code on GitHub to accompany his Lock Free presentation. It's a single producer/consumer queue that is very fast indeed.  His repository includes the gradual improvements made, which is unusual in software, and offers us some insights into the optimization process. In this post I break down the gradual steps even further, discuss the reasoning behind each step and compare the impact on performance. My fork of his code is to be found here and contains the material for this post and the up and coming post. For this discussion focus on the P1C1QueueOriginal classes.
The benchmarks for this post were run on a dual socket Intel(R) Xeon(R) CPU E5-2630 @ 2.30GHz machine running Linux and OpenJdk 1.7.09. A number of runs were made to unsure measurement stability and then a represantative result from those runs was taken. Affinity was set using taskset. Running on same core is pinned to 2 logical cores on the same physical core. Across cores means pinned to 2 different physical cores on the same socket. Across sockets means pinned to 2 different cores on different sockets.

Starting point: Lock free, single writer principle

The initial version of the queue P1C1QueueOriginal1, while very straight forward in it's implementation already offers us a significant performance improvement and demonstrates the important Single Writer Principle. It is worth while reading and comparing offer/poll with their counter parts in ArrayBlockingQueue.
Running the benchmark for ArrayBlockingQueue on same core/across cores/across sockets yields the following result (run the QueuePerfTest with parameter 0):
same core      - ops/sec=9,844,983
across cores   - ops/sec=5,946,312 [I got lots of variance on this on, took an average]
across sockets - ops/sec=2,031,953

We expect this degrading scale as we pay more and more for cache traffic. These result are our out of the box JDK baseline.
When we move on to our first cut of the P1C1 Queue we get the following result (run the QueuePerfTest with parameter 1):
same core      - ops/sec=24,180,830[2.5xABQ]
across cores   - ops/sec=10,572,447[~2xABQ]
across sockets - ops/sec=3,285,411[~1.5xABQ]

Jumping jelly fish! Off to a good start with large improvements on all fronts. At this point we have gained speed at the expense of limiting our scope from multi producer/consumer to single producer/consumer. To go further we will need to show greater love for the machine. Note that the P1C1QueueOriginal1 class is the same as Martin's OneToOneConcurrentArrayQueue.

Lazy loving: lazySet replaces volatile write

As discussed previously on this blog, single writers can get a performance boost by replacing volatile writes with lazySet. We replace the volatile long fields for head and tail with AtomicLong and use get for reads and lazySet for writes in P1C1QueueOriginal12. We now get the following result (run the QueuePerfTest with parameter 12):
same core      - ops/sec=48,879,956[2xP1C1.1]
across cores   - ops/sec=30,381,175[3xP1C1.1 large variance, average result]
across sockets - ops/sec=10,899,806[3xP1C1.1]

As you may or may not recall, lazySet is a cheap volatile write such that it provides happens-before guarantees for single writers without forcing a drain of the store buffer. This manifests in this case as lower overhead to the thread writing, as well as reduced cache coherency noise as writes are not forced through.

Mask of power: use '& (k pow 2) - 1 instead of %

The next improvement is replacing the modulo operation for wrapping the array index location with a bitwise mask. This 'trick' is also present in ring buffer implementations, Cliff Click CHM and ArrayDeque. Combined with the lazySet improvement this version is Martin's OneToOneConcurrentArrayQueue2 or P1C1QueueOriginal2 in my fork.
The result (run the QueuePerfTest with parameter 2):
same core      - ops/sec=86,409,484[1.9xP1C1.12]
across cores   - ops/sec=47,262,351[1.6xP1C1.12 large variance, average result]
across sockets - ops/sec=11,731,929[1.1xP1C1.12]

We made a minor trade off here, forcing the queue to have a size which is a power of 2 and we got some excellent mileage out of it. The modulo operator is quite expensive both in terms of cost and in terms of limiting instruction throughput and it is a trick worth employing when the opportunity rises.
So far our love for the underlying architecture is expressed by offering cheap alternatives for expensive instructions. The next step is sympathy for the cache line.
[UPDATE 3/11/2014: See further focused investigation into the merits of different modulo implementations here]

False sharing

False sharing is described elsewhere(here, and later here, and more recently here where the coming of a solution by @Contended annotation is discussed). To summarize, from the CPU cache coherency  system perspective if a thread writes to a cache line then it 'owns' it, if another thread then needs to write to the same line they need to exchange ownership. When this happens for writes into different locations the sharing is 'falsely' assumed by the CPU and time is wasted on the exchange. The next step of improvement is made by padding the head and tail fields such that they are not on the same cache line in P1C1QueueOriginal21.
The result (run the QueuePerfTest with parameter 21):
same core      - ops/sec=88,709,910[1.02xP1C1.2]
across cores   - ops/sec=52,425,315[1.1xP1C1.2]
across sockets - ops/sec=13,025,529[1.2xP1C1.2]

This made less of an impact then previous changes. False sharing is a less deterministic side effect and may manifest differently based on luck of the draw memory placement. We expect code which avoids false sharing to have less variance in performance. To see the variation run the experiment repeatedly, this will result in different memory layout of the allocated queue.

Reducing volatile reads

Common myth regarding volatile reads is that they are free, the next improvement step shows that to be false. In P1C1QueueOriginal22 I have reversed the padding of the AtomicLong (i.e head and tail are back to being plain AtomicLong) and added caching fields for the last read value of head and tail. As these values are only used from a single thread (tailCache is used by consumer, headCache used by producer) they need not be volatile. Their only use is to reduce volatile reads to a minimum. Normal reads, unlike volatile reads are open to greater optimization and may end up in a register, volatile reads are never from a register (i.e always from memory).
The result (run the QueuePerfTest with parameter 22):
same core      - ops/sec=99,181,930[1.13xP1C1.2]
across cores   - ops/sec=80,288,491[1.6xP1C1.2]
across sockets - ops/sec=17,113,789[1.5xP1C1.2]

By Toutatis!!! This one is a cracker of an improvement. Not having to load the head/tail value from memory as volatile reads makes a massive difference.
The last improvement made is adding the same false sharing guard we had for the head and tail fields around the cache fields. This is required as these are both written at some point and can still cause false sharing, something we all tend to forget can happen to normal fields/data even if it is nothing to do with concurrency. I've added a further implementation P1C1QueueOriginal23 where only the cache fields are padded and not the head and tail. It makes for a slight further improvement, but as the head and tail are still suffering from false sharing it is not a massive step forward.

UPDATE(21/09/2013): As Martin argues in the comments below the above justification on the source of improvement to performance gained by adding the cache fields is incorrect. The performance improvement is caused mostly by the reduction in cache coherency traffic. For an expanded explanation see this later post here. Volatile reads are by no means free, and some of the improvement is due to the reduction in reads, but it is not the main reason for the improvement.

All together now!

The final version P1C1QueueOriginal3 packs together all the improvements made before:
  • lock free, single writer principle observed. [Trade off: single producer/consumer]
  • Set capacity to power of 2, use mask instead of modulo. [Trade off: more space than intended]
  • Use lazySet instead of volatile write.
  • Minimize volatile reads by adding local cache fields. [Trade off: minor size increment]
  • Pad all the hot fields: head, tail, headCache,tailCache [Trade off: minor size increment]
The result (run the QueuePerfTest with parameter 3):
same core      - ops/sec=110,405,940[1.33xP1C1.2]
across cores   - ops/sec=130,982,020[2.6xP1C1.2]
across sockets - ops/sec=19,338,354[1.7xP1C1.2]

To put these results in the context of the ArrayBlocking queue:
same core      - ops/sec=110,405,940[11xABQ]
across cores   - ops/sec=130,982,020[26xABQ]
across sockets - ops/sec=19,338,354[9xABQ]

This is great improvement indeed, hat off to Mr. Thompson.

Summary, and the road ahead

My intent in this post was to give context and add meaning to the different performance optimizations used in the queue implementation. At that I am just elaborating Martin's work further. If you find the above interesting I recommend you attend one of his courses or talks (if you can find a place).
During the course of running these experiments I encountered great variance between runs, particularly in the case of running across cores. I chose not to explore that aspect in this post and picked representative measurements for the sake of demonstration. To put it another way: your mileage may vary.
Finally, my interest in the queue implementation was largely as a data structure I could port off heap to be used as an IPC messaging mechanism. I have done that and the result are in my fork of Martin's code here. This post evolved out of the introduction to the post about my IPC queue, it grew to the point where I decided they would be better apart, so here we are. The IPC queue is working and achieves similar throughput between processes as demonstrated above... coming soon.
UPDATE: IPC post published here hitting 135M between processes
UPDATE: Run to run variance further explored and eliminated here

Wednesday, 6 March 2013

Merging Queue, ArrayDeque, and Suzie Q

A merging queue is a map/queue data structure which is useful in handling data where only the latest update/value for a particular topic is required and you wish to process the data in FIFO manner. Implement 2 flavours, benchmark and discussion.
The merging queue is a useful construct for slow consumers. It allows a bounded queue to keep receiving updates with the requirement for space limited to the number of keys. It also allows the consumer to skip old data. This is particularly of interest for systems dealing with fast moving data where old data is not relevant and will just slow you down. I've seen this requirement in many pricing systems in the past few years, but there are other variations.
Here's the interface:

What about LinkedHashMap?

Now it is true that LinkedHashMap offers similar functionality and you could use it to implement a merging queue as follows:
This works, but the way we have to implement poll() is clumsy. What I mean by that is that it looks like we are asking for allot more than we want to work around some missing functionality. If you dig into the machinery behind the expression: "lastValMap.remove(lastValMap.keySet().iterator().next())" there's an awful lot of intermediate structures we need to jump through before we get where we are going. LinkedHashMap is simply not geared toward being a queue, we are abusing it to get what we want.

ArrayDeque to the rescue!

ArrayDeque is one of the unsung heroes of the java collections. If you ever need a non-concurrent queue or stack look no further than this class. In it's guts you'll find the familiar ring buffer. It doesn't allocate or copy anything when you pop elements out or put them in(unless you exceed the capacity). It's cache friendly(unlike a linked list). It's LOVELY!
Here's a merging queue implementation using a HashMap and ArrayDeque combined:
You can replace the HashMap with an open address implementation to get more cache friendly behaviour for key collisions if you like, but in the name of KISS we won't go down that particular rabbit hole. As the comment states, setting entries to null rather than removing them is an optimization with a trade off. If your key set is not of a finite manageable range then this is perhaps not the way to go. As it stands it saves you some GC overheads. This optimization is not really open to you with LinkedHashMap where the values and their order are managed as one.
ArrayDeque is a better performer than any other queue for the all the reasons discussed in this StackOverflow discussion, which boil down to:

  • backed by a ring buffer (yes, like the Disruptor! you clever monkeys) 
  • it uses a power of 2 sized backing array, which allows it to replace modulo(%) with a bit-wise and(&) which works because x % some-power-of-2 is the same as x & (some-power-of-2 - 1)
  • adding and removing elements is all about changing the head/tail counters, no copies, no garbage (until you hit capacity).
  • iterating through an array involves no pointer chasing, unlike linked list.


I like the way you walk, I like the way you talk, Susie Q!

I'm using a micro benchmarking framework which is both awesome and secret, so sadly the benchmark code is not entirely peer review-able. I will put the benchmark on GitHub when the framework makers will give me the go ahead which should be soon enough. Here are the benchmarks:

The results(average over multiple runs) are as follows:
Experiment                            Throughput           Cost
array.measureOffer,                   100881.077 ops/msec, 10ns
array.measureOffer1Poll1,              41679.299 ops/msec, 24ns
array.measureOffer2Poll1,              30217.424 ops/msec, 33ns
array.measureOffer2Poll2,              21365.283 ops/msec, 47ns
array.measureOffer1000PollUntilEmpty,    102.232 ops/msec, 9804ns
linked.measureOffer,                  103403.692 ops/msec, 10ns
linked.measureOffer1Poll1,             24970.200 ops/msec, 40ns
linked.measureOffer2Poll1,             16228.638 ops/msec, 62ns
linked.measureOffer2Poll2,             12874.235 ops/msec, 78ns
linked.measureOffer1000PollUntilEmpty,    92.328 ops/msec, 10830ns
--------

Interpretation:

  • Offer method cost for both implementations is quite similar at 10ns, with the linked implementation marginally faster perhaps.
  • Poll method cost is roughly 14ns for the array deque based implementation, and 30ns for the linked implementation. Further profiling has also shown that while the deq implementation generates no garbage the linked implementation has some garbage overhead.
  • For my idea of a real world load the array deq is 10% faster.
Depending on the ratio between offer and poll the above implementation can be quite attractive. Consider for instance that queues/buffer buildup tends to be either empty, or quite full when a burst of traffic comes in. When you are dealing with relatively little traffic the cost of polling is more significant, when you are merging a large buildup of updates into your queue the offer cost is more important. Luckily this is not a difficult choice as the array deque implementation is only marginally slower for offering and much faster for polling.
Finally, a small real world gem I hit while writing this blog. When benchmarking the 1k offer/queue drain case for the linked implementation I hit this JVM bug - "command line length affects performance". The way it manifested was bad performance (~50 ops/ms) when running with one set of parameters and much better performance when using some extra parameters to profile GC which I'd have expected to slow things down if anything. It had me banging my head against the wall for a bit, I wrote a second benchmark to validate what I considered the representative performance etc. Eventually I talked to Mr. Shipilev who pointed me at this ticket. I was not suffering the same issue with the other benchmarks, or the same benchmark for the other implementation which goes to show what a slippery sucker this is. The life lesson from this is to only trust what you measure. I can discard the benchmark result if I like, but if you change your command line arguments in a production environment and hit a kink like that you will have a real problem.
Many thanks to Doug Lawrie with whom I had a discussion about his implementation of a merging event queue (a merging queue stuck on the end of a Disruptor) which drove me to write this post.

Update 08/03/2013: Just realized I forgot to include a link to the code. Here it is.