Psychosomatic, Lobotomy, Saw: 2013

Friday 13 December 2013

JAQ: Using JMH to Benchmark SPSC Queues Latency - Part I

{This post is part of a long running series on lock free queues, checkout the full index to get more context here}
{If you come here looking for JMH related content start at the new and improved JMH Resources Page and branch out from there!}
As part of the whole JAQ venture I decided to use the excellent JMH (Java Microbenchmark Harness see: project here, introduction to JMH here, and multi-threaded benchmarking here). The reason I chose JMH is because I believe hand rolled benchmarks limit our ability to peer review and picking a solid framework allows me and collaborators to focus on the experiment rather than the validity of the harness. One of the deciding factors for me was JMHs focus on multi-threaded benchmarks, which is something I've not seen done by other frameworks and which forms it's own set of problems to the aspiring benchmarker. It's not perfect, but it's the closest we have.
Anyways... that's the choice for now, lets jump into some benchmarks!

Using the JMH Control

To benchmark the latency of 2 threads communicating via a queue requires using the JMH Control structure. This is demonstrated in the JMH samples measuring thread to thread "ping pong" via an AtomicBoolean.compareAndSwap (see original here):

The challenge here is that the CAS operation may fail, and when it does we would want to retry until we succeed, BUT if the opposite thread were to be stopped/suspended because the benchmark iteration is done we would be stuck in an infinite loop. To avoid this issue a benchmarked method can have a Control parameter passed in which exposes a flag indicating the end of a run, this flag can be used to determine benchmark termination and thus avoid the above issue. You'll notice other JMH annotations are in use above:

State: sets the scope of the benchmark state, in this case the Group
Group: multi-threaded benchmarks can utilize thread groups of different sizes to exercise concurrent data structure with different operations. In the case above we see one group calls pingpong. Both methods are exercised by threads from the group, using the default of 1 thread per operation.
Note that if we were to run the above benchmark with more threads, we will be running several independent groups. Running with 10 threads will result in 5 concurrent benchmark runs each with it's own group of 2 threads.

I pretty much copied this benchmark into the jaq-benchmarks and refer to it as a baseline for inter-thread message exchange latency. The results for this benchmark are as follows (running on the same core):
Benchmark Mode Thr Count Sec Mean Mean error Units
i.j.s.l.BaselinePingPong.pingpong avgt 2 10 1 23.578 0.198 ns/op
i.j.s.l.BaselinePingPong.pingpong:ping avgt 2 10 1 23.578 0.198 ns/op
i.j.s.l.BaselinePingPong.pingpong:pong avgt 2 10 1 23.578 0.198 ns/op
And running across different cores:
Benchmark Mode Thr Count Sec Mean Mean error Units
i.j.s.l.BaselinePingPong.pingpong avgt 2 10 1 60.508 0.217 ns/op
i.j.s.l.BaselinePingPong.pingpong:ping avgt 2 10 1 60.496 0.195 ns/op
i.j.s.l.BaselinePingPong.pingpong:pong avgt 2 10 1 60.520 0.248 ns/op

The way I read it CAS is sort of a round trip to memory, reading and comparing the current value, and setting it to a new value. You could argue that it's only a one way trip, and I can see merit in that argument as well. The point is, these are the ballpark figures. We shouldn't be able to beat this score, but if we manage to get close we are doing well.
Note from a hardware point of view: See this answer on Intel forums on cache latencies. Latencies stand at roughly:

"L1 CACHE hit, ~4 cycles
L2 CACHE hit, ~10 cycles
L3 CACHE hit, line unshared ~40 cycles
L3 CACHE hit, shared line in another core ~65 cycles
L3 CACHE hit, modified in another core ~75 cycles
remote L3 CACHE ~100-300 cycles
Local Dram ~60 ns
Remote Dram ~100 ns"

To convert cycles to time you need to divide by your CPU clock speed, the delay manifestation will also be dependant on whether or not you require the data to make progress etc. The thing to keep in mind here is that there is a physical limit to the achievable latency. If you think you're going faster... check again ;-).

Queue round trip latency benchmark

Queues are not about exchanging a single message, they are about exchanging many messages asynchronously (at least that's what the queues I care about are about). Exchanging a single message is a worst case scenario for a queue, there's no gain to be had by batching, no false sharing to avoid, nothing much to cache, very little cost to amortize. It is therefore a valuable edge case, and an important one for many real world applications where latency needs to be minimized for bursty traffic/load. To put my queues to the test I devised a round trip benchmark which allows us to chain a configurable number of threads into a ring and measure the round trip latency. This should help magnify the exchange costs and perhaps expose behaviours not so easily detected in the minimal case. Here's an outline of the functionality under test given a chain length N (which requires N queues and N threads including the measuring thread) and a burst of K messages:

From thread[0] send K objects down queue[0].
Concurrently: In each thread K(0<K<N) read from queue[K-1] and write to queue[K].
Thread[0] waits for K objects to be polled from queue[N-1] and returns.

There were several challenges/intricacies to be considered here:

The number of benchmarking threads is unknown at runtime (in JMH), so chain length must be defined separately.
Setup/Teardown methods on the Iteration level are called per thread.
Thread state is allocated and used per thread per iteration. The threads are pooled, so a given thread can switch 'roles' on each iteration. In other words @State(Scope.Thread) is not ThreadLocal.
Concurrent polls to SPSC queues can leave the queue in an undefined state.
The benchmark requires that the queues are empty when we start measuring, so at the beginning or end of an iteration we must clear the queues.
Queues must be cleared from the consumer thread.

So things are not that simple, and the code (full code here) is an attempt to cover all these intricacies:

I'm relying on correct usage for some things, but JMH is basically doing most of the heavy lifting:

JMH is taking care of all the measurements/measurement modes/reporting/formatting including forked runs and stats summaries and all.
JMH is taking care of thread management, no need to spin up threads/executors and shut them down.
JMH is synchronizing iterations and managing per thread state setup/teardown. Note the @TearDown annotation in Link/Source.
JMH is allowing the group thread balance to be parametrized (just recently) which allows generic asymmetric multi-threaded benchmarks like the above.

Where I have to work around the framework:

I would like to only specify the chain.length and derive the thread allocation from that. This is possible via the JMH launcher API.
I would like @State(Scope.Thread) to mean ThreadLocal.
I would like to be able to pass State down the stack as constructor parameters, so I could avoid the static field for sharing benchmark state between thread state objects.

I'd be lying if I said writing this benchmark the JMH way was easier than hacking something together, it take getting used to and some of the features I required are fairly late additions. But the benefit is worth it.
I can't stress this point enough, the objective here is more than collecting data, it is about enabling, supporting and empowering peer review. I want the benchmarks to be as simple as they can be so people can either be easily convinced or easily point out my errors. The less code I need my peers to read, the easier it is for anyone to build/run/tweak the benchmarks, the less we rely on handrolled/halfassed attempts at benchmark frameworks THE BETTER (having said that, you got to do what you got to do).

Note: If you care about latency you should watch Gil Tene's excellent talk on How Not To Measure Latency, where he discusses coordinated omission and it's impact on latency measurement. As you may have noticed this experiment suffers from coordinated omission. I'll have to live with that for now, but I don't feel the data gained from a corrected measurement would impact the queue implementations in any form.

Hard Numbers? Numbers Hard?

We've got the following queues:

ConcurrentLinkedQueue (CLQ) - Good old JDK lock free queue, for reference.
InlinedCountersSpscConcurrentArrayQueue (CAQ) - My inlined counter variation on Martin Thompson's ConcurrentArrayQueue previously discussed/explained/explored here and compared to this variation here. It can be run with sparse data as described here.
FFBuffer (FF1) - A Java port of the FastFlow SPSC implementation with sparse data as described here.
BQueue (BQ)- An SPSC previously discussed here which introduces an offer side batch of the null check (which also serves as a full queue induced false sharing protection) and a poll side batching of emptiness testing which also includes a spinning backoff to avoid the near empty queue issue.
FFBufferWithOfferBatch (FF2) - A BQueue on the offer side and an FFBuffer on the poll side, this is a new addition to the arsenal.

JMH gives us 2 relevant benchmark modes for the above use case:

AverageTime - Average time per operation (from the command line use '-bm avgt')
SampleTime - Time distribution, percentile estimation (from the command line use '-bm sample')

And we could also look at the same core vs. cross core latency, vary the length of the chain and examine the effect of sparse data on latency. This is a bit of an explosion of data, too much for one post, but lets dip our toes...
I'll start with the average RTT latency for each type, running same core and cross core:

The benchmarks were run on my laptop (JDK7u45/Ubuntu13.10/i7-4700MQ@2.40GHz no turbo boost/benchmark process pinned using taskset) and though I ran them repeatedly I didn't systematically collect run to run variance data, so consider the results subject to some potential run to run variance.
As we can see:

FF2/FF1 are very similar and take the cake at 60/230ns (FF2 is marginally faster).
BQ is the worst [392/936ns] because of its backoff on approaching near empty strategy.
It is perhaps surprising to see that although CAQ/FF1/BQ show very similar throughput in my previous tests their latency characteristics are so distinct.
Note that CLQ is actually not too bad latency wise (in this particular use case).

Lets have a look at what happen when the burst size increases[I dropped FF1 out of the race here, too similar to FF2, this is running across cores]:

We can see that:

CLQ makes no effort to amortize costs as its RTT grows with the burst size.
BQ starts off as the worst, but somewhere between burst size 16 and 32 the benefits of it's design start to show.
FF2 and CAQ demonstrate similar growth patterns with the cost per item dropping as the burst size grows. But FF2 remains the clear winner, at burst size 32 the round trip is 640ns.
640ns for 32 messages averages at just 20ns per message. This is NOT A MEANINGFUL LATENCY NUMBER! :)

Latency != Time/Throughput

Let me elaborate on this last point. The average cost is not a good representative of the latency. Why not? latency in this case is the round trip time from each message perspective, throughput in a concurrent environment is not an indicator of latency because it assumes a uniformity that is just not there in real life. We know the behaviour for the first message is at best the same, so assuming first message latency is 230ns, how would the other messages timeline need to look to make the mean latency 20ns? is that reasonable? What is really happening here is as follows (my understanding in any case):

Producer pretty much writes all the messages into the queue in one go within a few nano-seconds of first and last message being written. So if we mark T1..T32 the send time of the messages, we can estimate T32 - T1 is within 100ns - 150ns (single threaded read/write to FF2 takes roughly 6ns, assume an even split between write and read and add some margin). Near empty queue contention can add to this cost, or any direct contention between poll/offer.
In fact, assuming a baseline cost of 6ns read/write, multiplied by chain length, leads to a base cost of 32*6*2=384ns. Following from this is that the minimum latency experienced by any message can never be less than 32*3=96ns as the burst must be sent before messages are received. 96ns is assuming zero cost for the link, this is how fast this would go offering and polling from the same queue in the same thread.
With any luck, and depending on the size of the burst, the consumer will see all the messages written by the time it detects data in the queue (i.e. it won't contend with the producer). There is opportunity here for reduced cost which is used by FF2/CAQ and BQ.
The consumer now reads from the queue, the first message is detected like it would for burst size 1, i.e after 115ns or so for FF2. This can only get worse with the contention of potentially other messages being written as the first message is read adding further latency. The rest of the messages are then read and there is again an opportunity to lower costs. The messages are written into the outbound queue as soon as they are read. Reading and writing each message has a cost, which we can assume is at the very least as high as 6ns.
The producer now detects the messages coming back. Let's mark the return times for the messages R1-R32. We know T1 - R1 is at least 230ns and we know that T1 - R32 is roughly 640ns. Assuming some fixed costs for writing/reading from the queue, we can estimate that R32-R1 is also within 100-150ns.
So, working through the timeline as estimated above gives us the following best case scenario for message 32:

3ns - Source sends message 1
96ns - Source sends message 32
115ns - Link picks up message number 1, has to work through messages 1-31
230ns - Source picks up message 1
301ns - Link is done transferring 31 messages and is finally picking up message 32
307ns - Link has sent message 32
346ns - Source has finished processing message 32

In this hypothetical best case scenario, message 32 which had the best chance of improved latency enjoyed a 150ns RTT. This is the theoretical best case ever. IT DIDN'T HAPPEN. The RTT we saw for a 32 messages burst was 640ns, indicating some contention and an imperfect world. A world in which there's every chance message 32 didn't have a latency as good as message 1.
The fallacy of the average latency is that it assumes a uniform distribution of the send and receive times, but the reality is that the groups are probably lumped together on a timeline leading to many send times being within a few ns of each other, and many receive times being within a few ns of each other. This is the magic of buffering/queuing.
The latency experienced by the last element in a burst can be better than the latency experienced by the first, but not significantly so and subject to some very delicate timing. The important thing here is that good queues allow the producer to make progress as the consumer idles or is busy consuming previous items, thus allowing us to capitalize on concurrent processing.

To Be Continued...

There's allot of experimentation and data to be had here, so in the interest of making progress with the project as well as sharing the results I'll cut this analysis short and continue some other time. To be explored in further posts:

Compare sparse data impact on CAQ/FF2 for the bursty use case - I have done some experimentation which suggests FF2 gains little from the sparse data setting, while for CAQ the improvements are noticeable for certain burst sizes.
Compare average time results with the sampling mode results - There are significant differences in the results, which I'm still trying to explain. One would expect the sampling mode to have some fixed overhead of having to call System.nanoTime() twice for each sample, but the difference I see is an x2 increase in the mean.
Compare queue behaviour for different chain lengths - The results are unstable at the moment because the thread layout is not fixed. For chain length 2-4 the topography can be made static, but from chain size 5 and up the number of cross core crossings can change in mid run (on my hyper threaded quad core anyways), leading to great variance in results (which is legitimate, but annoying). I will need to pin threads and thread 'roles' to stabilize the benchmark for this use case, probably using JTA (Java Thread Affinity - by Sir Lawrey).

The overall result is pointing to FF2 as the clear winner of the fastest queue competition, but the variety is educational and different designs are useful for different things. This also strengthens my conviction that a requirement based factory for queues serves your code much better than naming the queue flavour you feel like as you code. All the more reason to invest more time into JAQ :-).

Thanks to Darach, Norman, and Peter for reviewing the post, and Aleksey for reviewing the benchmark. Peer review FTW!

Monday 2 December 2013

Announcing the JAQ(Java Alternative Queues) Project

{This post is part of a long running series on lock free queues, checkout the full index to get more context here}
To quote Confucius:

"To learn and then practice it time and again is a pleasure, is it not? To have friends come from afar to share each others learning is a pleasure, is it not? To be unperturbed when not appreciated by others is gentlemanly, is it not?" - Analects 1:1

It is obvious to me the old man was talking about open source software, where we repeat what we learn, share with friends from afar, and try and behave when no one seems to get it. In this spirit I am going to try and apply lessons learnt and put together a concurrent queues library for Java - Java Alternative Queues.
It's early stages, but at this point I would value some feedback on:

Intentions
Interfaces and usability
Project roadmap

Intentions

When concurrent queues are concerned, it is my opinion that the JDK offering has been robust, but too generic to benefit from the performance gains offered by a more explicit declaration of requirements. JAQ would tackle this by providing queues through a requirements focused factory interface allowing the user to specify upfront:

Number of producers/consumers
Growth: Bounded/Unbounded
Ordering (FIFO/other)
Size
Prefer throughput/latency

To see a wider taxonomy of queues see 1024Cores.net excellent analysis. At this point all the queues I plan to implement are non-blocking and lock-free as my main interest lies in that direction, but many of the principals may hold for blocking implementations and those may get added in the future.

Interfaces and Usability

I like the idea of separating several entities here:

ConcurrentQueueFactory - Tell me what you need, through a ConcurrentQueueSpec.
ConcurrentQueue - The queue data structure, provided by the factory. At the moment it does nothing but hand out producer/consumer instances. This is where pesky methods such as size/contains may end up. I'm not keen on supporting the full Queue interface so feedback on what people consider essential will be good.
ConcurrentQueueConsumer - A consumer interface into the queue, provided by the queue. I'm planning to support several consumption styles,.
ConcurrentQueueProducer - A producer interface into the queue, provided by the queue.

The producer/consumer instances are thread specific and the appropriate thread should call into the queue provider method. Here is the old QueuePerfTest converted to use the new interface (I cleared out the irrelevant cruft for timing this and that):

I realize this goes against the current Queue interface, but part of the whole point here is that the more we know about the calling code the better performance/correctness we can hope to offer.

Roadmap

I'd like to tackle the work in roughly the following order:

Specify/document/settle on high level interfaces (initial cut done)
SPSC implementations/tests/benchmarks (good bounded SPSC implementation is done, some benchmarks)
MPSC implementations/tests/benchmarks (some bounded MPSC variants are included but not integrated)
SPMC implementations/tests/benchmarks (not started)
MPMC implementations/tests/benchmarks (not started)

There's more I want to do in this area, but the above will keep me busy for a while so I'll start with that and increase the scope when it reaches a satisfying state.

I'm using JMH (and getting valuable support from @shipilev) for benchmarking the queues and hoping to use JCStress to test multi-threaded correctness.

Contributors/Interested Parties

I know I'll be using this library in the near future for a few projects, but I hope it will be generally useful so your feedback, comments and observations are very welcome. I've not been involved much in open-source projects before, so any advise on project setup is also welcome. Finally, if you feel like wading in and cutting some code, adding some tests or benchmarks, reporting some bugs or expressing interest in features BRING IT ON :-) pull requests are very welcome.

Monday 11 November 2013

SPSC IV - A look at BQueue

{This post is part of a long running series on lock free queues, checkout the full index to get more context here}
I thought I was done with SPSC queues, but it turns out they were not done with me...
A kind reader, Rajiv Kurian, referred me to this interesting paper on an alternative SPSC implementation called BQueue. The paper compares the BQueue to a naive Lamport queue, the FastFlow queue and the MCRingBuffer queue and claims great performance improvements both in synthetic benchmarks and when utilised in a more real world application. I had a read, ported to Java and compared with my growing collection, here are my notes for your shared joy and bewilderment.

Quick Summary of Paper Intro. Up To Implementation

I enjoyed reading the paper, it's not a hard read if you have the time. If not, here's my brief notes (but read the original, really):

Some chit chat goes into why cross core queues are important - agreed.
Paper claims many SPSC queues perform very differently in a test-bed vs real world, in particular as real world applications tend to introduce cache traffic beyond the trivial experienced in naive benchmarking - I agree with the sentiment, but I think the "Real World" is a challenging thing to bring into your lab in any case. I would argue that a larger set of synthetic experiments would serve us all better in reviewing queue performance. I also admit my benchmarking thus far has not been anywhere rich enough to cover all I want it to cover. The application presented by the authors is a single sample of the rich ocean of applications out there, who's to say they have found the one shoe which fits all? Having said that, it still strikes me as a good way to benchmark and a valid sample to add to the general "your mileage may vary, measure with your own application" rule.
Existing queue implementations require configuration or deadlock prevention mechanisms and are therefore harder to use in a real application -While this argument is partially true for the solutions discussed thus far on this blog, I fear for best results some tuning has to be done in any case. In particular the solution presented by the paper still requires some configuration.
Examines Lamport and points out cache line thrashing is caused by cross checking the head/tail variables - Code for the Lamport variation can be found here. The variant put forward by Martin Thompson and previously discussed here (with code here) solves this issue by employing producer/consumer local caches of these values. This enhancement and a more involved discussion of cache coherency traffic involved is further discussed here.
Examines FastForward queue, which is superior to Lamport but - "cache thrashing still occurs if two buffer slots indexed by head and tail are located in the same cache line". To avoid this issue FF employs Temporal Slipping which is another way of saying a sleep in the consumer if the head/tail are too close. This technique leads to cache thrashing again, and the definition of too close is the configuration mentioned and resented above. - Right, I was not aware of temporal slipping, but I love the term! Next time I'm late anywhere I'll be sure to claim a temporal slippage. More seriously: this strikes me as a streaming biased solution, intentionally introducing latency as the queue becomes near empty/full. From the way it is presented in the article it is also unclear when this method is to be used, as it is obviously not desirable to inflict it on every call.
Examine MCRingBuffer, which is a mix of 2 techniques: head/tail caching (like the one employed by Mr. T) and producer batching. The discussion focuses on the issues with producer batching (producer waits for a local buffer to fill before actually writing anything to the queue), in particular the risk of deadlock between 2 threads communicating via such queues - I would have liked the discussion to explore the 2 optimisations separately. The head/tail cache is a very valid optimisation, the producer batching is very problematic due to the deadlock issue described, but also as it introduces arbitrary latency into the pipeline. Some variations on the queue interface support batching by allowing the publisher to 'flush' their write to the queue, which is an interesting feature in this context.
Some discussion of deadlock prevention techniques follows with the conclusion that they don't really work.

Looking at the Implementation

So I've been playing with SPSC queues for a while and recently wrote a post covering both the FastFlow queue and the use of sparse data to eliminate the near empty thrashing issue described above. I was curious to see a new algorithm so ported the algorithm from the paper to Java. Note the scary Unsafe use is in order to introduce the required memory barriers (not NOOPs!). Here is what the new algo brings to the table when written on top of my already existing SPSC queues code (code here):

Changes I made to original (beyond porting to Java):

Renamed tail/head to be in line with my previous implementations.
Renamed the batch constants according to their use.
Left the modulo on capacity to the offset calculation and let the counters grow as per my previous implementations (helped me eliminate a bug in the original code).

In a way it's a mix of the FastFlow(thread local use of counters, null check the buffer to verify offer/poll availability) and the MC/Mr.T solutions(use a cached look ahead index value we know to be available a write/read up to that value with no further checks), which I like (yay cross pollination/mongrels... if you are into any variation on racial purity try having children with your cousins and let me know how it went). The big difference lies in the way near empty/full condition are handled. Here's a few points on the algorithm:

The cached batchHead/Tail are used to eliminate some of the offer/poll need to null compare we have in FFBuffer. Instead of reading the tail/head every time we probe ahead on the queue. This counts on the null/not-null regions of the queue being continuous:

For the producer, if the entry N steps ahead of TAIL (as long as N is less than capacity) is null we can assume all entries up to that entry are also null (lines 4-11).
Similarly we probe ahead as we approach the near empty state, but here we are looking for a not null entry. Given that the not null region is continuous, we probe into the queue ahead of the HEAD, if the entry is null then we have overshot the TAIL. If it is not null we can safely read up to that point with no further check.

The BQueue offers us a variation on the above mentioned temporal slip that involves the consumer doing a binary search for a successful probe until it hits the next batch size or declares the queue empty (lines 40 - 49). This is triggered every time the consumer needs to re-calculate the batchHead value having exhausted the known available slots.
Every poll probe failure is followed by a spinning wait (line 55-59).
Using the buffer elements as the indicator of availability allows the BQueue to avoid reverting to the cache thrashing caused by comparing and sharing the head/tail counters.
The slowdown induced by the algorithm and the spinning wait allows the producer to build up further slack in the queue.

The temporal slip alternative is the main contribution celebrated in the paper, so it is worth comparing the code snippets offered for all variations.

A Word On Benchmarking

I ran the same benchmark I've been using all along to evaluate SPSC queues. It's synthetic and imperfect. The numbers you get in this benchmark are not the numbers you get in the real world. Your mileage is almost certain to vary. I've been getting some worried responses/twits/comments about this... so why do I use it?

If you have been reading this series of posts you will see that the benchmark has been instrumental in identifying and measuring the differences between implementations, it's been good enough to provide evidence of progress or regression.
As a test harness it's been easy to use and examine via other tools. Printing out assembly, analysing hardware counter data, tweaking to a particular format: all easily done with a simple test harness.

It's imperfect, but practical. I'm increasingly uncomfortable with it as a benchmark because I don't feel the harness scales well to multi-producer examination and I'm in the process of implementing the benchmarks using JMH. As that work is not complete I use what I have.

In the interest of completeness/open for review/honesty/general info, here is how I run benchmarks at the moment:

I'm using my laptop, a Lenovo y510p. It's an i7-4700MQ.
When running benchmarks I pin all running processes to core 0, then run my benchmarks on cores 1-7 (I avoid 1 if I can). It's not as good is isolcpus, but good enough for a quiet run IMHO.
For this current batch of runs I also disabled the turbo-boost (set the frequency to 2.4GHz) to eliminate related variance. In the past I've left it on (careless, but it was not an issue at the time), summer is coming to South Africa and my machine overheats with it on.
I'm running Ubuntu 13.10 64 bit, and using JDK7u45 also 64 bit.
All the benchmarks were run with "-server -XX:+UseCondCardMark -XX:CompileThreshold=100000". I use these parameters for consistency with previous testing. In particular the UseCondCardMark is important in this context. See previous post for more in depth comparison of flags impact on performance.
For this set of results I only examined the cross core behaviour, so the runs were pinned to core 3,7
I do 30 runs of each benchmarks to get an idea of run to run variance.

I include the above as an open invitation for correction and to enable others to reproduce the results should they wish. If anyone has the time and inclination to run the same benchmarks on their own setup and wants to share the data I'd be very interested. If anyone has a suite of queue benchmarks they want to share, that would be great too.

As for the validity of the results as an indication of real world performance... the only indication of real world performance is performance in the real world. Having said that, some problems can be classified as performance bugs and I would argue that false-sharing, and bad choice of instructions falls under that category. All other things being equal I'd expect the same algorithm to perform better when implemented such that it's performance is optimal.

Taking BQueue for a Spin

And how does it perform? It depends... on the configuration(what? but they said?! what?). There are 3 parameters to tune in this algorithm:

TICKS - how long to spin for when poll probing fails.
POLL_MAX_BATCH_SIZE - the limit, and initial value for the poll probe.
OFFER_BATCH_SIZE - how far to probe for offer, also note that OFFER_BATCH_SIZE elements will remain unused of the queue capacity (i.e the queue can only fit [capacity - OFFER_BATCH_SIZE] elements before offer fails).

I played around with the parameters to find a good spot, here's some variations (see the legend for parameters, X is run number, Y is ops/sec):

There's no point in setting the batch sizes lower than 32 (that's 2 cache lines in compressed OOPS references) as we are trying to avoid contention. Also, it'll just end up being a confusing branch for the CPU to hit on and the whole point of batching is to get some amortized cost out of it. Even with batch sizes set as low as 32 for both the queue performs well (median is around 261M) but with significant variance.
Upper left chart shows the results as the offer batch is extended (poll batch remains at 32). The larger offer batch offers an improvement. Note that extending the offer batch only gives opportunity for the producer to overtake the consumer as the consumer hits the poll probing logic every 32 steps. A large slack in the queue allows the best throughput (assuming consumer is mostly catching up). The variance also decreases as the offer batch is increased.
Upper right chart shows the results for increasing the poll batch (offer batch remains at 32). As we can see this actually results in worse variance.
Increasing both the offer and the batch size in step (lower left chart) ends up with better overall results, but still quite significant variance
Keeping the offer/poll batch at 8k I varied the spin period which again results in a different profile, but no cure to variance.

For context here is the BQ result (I picked the relatively stable 80,32,8k run) compared with the FFBuffer port and my inlined counters variation on Thompson/Lamport with and without sparse data(Y8=inlined counters,the square brackets are parameters, for FF/Y8 it is the sparse shift so 2 means use every 4th reference):

[NOTE: previous benchmarks for same queues were run with turbo boost on, leading to better but less stable results, please keep in mind when considering previous posts]
As you can see the BQueue certainly kicks ass when compared to the non-sparse data variations, but is less impressive when sparse data is used. Note that BQueue manages to achieve this result with far less memory as it still packs the references densely (note that some memory still goes wasted as mentioned above, but not as much as in the sparse data variations).What I read into the above results:

This algorithm tackles the nearly empty/full queue in an appealing manner. Probing the queue to discover availability is also a means of touching the queue ahead and bringing some required data into the cache.
The main reason for the speed up is the slowing down of the consumer when approaching empty. This serves as a neat solution for a queue that is under constant pressure.
The spinning wait between probes presents a potentially unacceptable behaviour. Consider for instance a consumer who is sampling this queue for data, but needs to get on with other stuff should it find it empty. Alternatively consider a low latency application with bursty traffic such that queues are nearly empty most of the time.
I'll be posting a further post on latency benchmarking the queues, but currently the results I see (across cores, implementing ping pong with in/out queue) suggest the FF queue offers the best latency(200ns RTT), followed by Y8(300ns) and the BQueue coming in last(750ns). I expect the results to be worst with bursty traffic preventing the batch history from correctly predicting a poll batch size.

Summary

This is an interesting queue/paper to me, importantly because it highlights 2 very valid issues:

Near empty/full queue contention is a significant concern in the design of queues and solving it can bring large performance gains to throughput.
Real application benefits may well differ from synthetic benchmark benefits. To support better decision making a wider variety of tests and contexts needs to be looked at.

I think the end solution is appropriate for applications which are willing to accept the latency penalty incurred by the consumer when hitting a nearly empty queue. The near full queue guard implemented for this queue can benefit other queue implementations and has no downside that I can see beyond a moderate amount of wasted space.

Thanks again to Rajiv Kurian for the pointer to this queue, and Norman M. for his help in reviewing this post.

Monday 21 October 2013

A Lock Free Multi Producer Single Consumer Queue - Round 1

{This post is part of a long running series on lock free queues, checkout the full index to get more context here}
Writing a lock free MPSC queue based on the previous presented SPSC queues and exploring some of the bottlenecks inherent in the requirements and strategies to clearing them.

2 or more...

Having just recovered from a long battle with the last of SPSC queue series posts I'm going to keep this short. This time we are looking at a slightly more common pattern, many producing threads put stuff on a queue which funnels into a single consumer thread. Maybe that queue is the only queue allowed to perform certain tasks (like accessing a Selector, or writing to a socket or a file). What could we possibly do for this use case?

Benchmark

We'll be comparing the performance of the different alternatives using this benchmark which basically does the following:

Allocate a queue of the given size (always 32k in my tests) and type
Start N producer threads:

A producer thread offers 10M/N items (yield if offer fails)
Start time before first item is enqueued

In main thread consume [10M * N] items from queue, keep last item dequeued
Stop timing: END
Overall time is [END - minimum start from producers], print ops/sec and the last dequeued item
Repeat steps 2 to 5, 14 times, keep ops/sec result in an array
Print out summary line with average ops/sec for last 7 iterations

Very similar to previous benchmark for SPSC. The producers all wait for the consumer to give the go ahead, so contention is at it's worse (insert evil laughter, howling dogs). I ran the tests a fair few times to expose run to run variance on a test machine (Ubuntu13.04/JDK7u40/i7-4700MQ) with 1 to 6 producer threads. To make the graphs a little easier to grok I took a representative run and for now will leave run to run variance to be discussed separately.

Baseline JDK Queues: ArrayBlockingQueue, ConcurrentLinkedQueue and LinkedTransferQueue

These are battle tested long standing favourites and actually support MPMC, so are understandably less specialized. ArrayBlockingQueue is also a blocking implementation, so we can expect it to experience some pain from that. Here's how they do:

We can see that CLQ/LTQ perform best when there's a single producer and quickly degrades to a stable baseline after a few producers join the party. The difference between 1 producer and many is the contention on offering items into the queue. Add to that the fact that the consumer is not suffering from the same amount of contention on it's side and we get a pile up of threads. ABQ gains initially from the addition of threads but then degrades back to initial performance. A full analysis of the behaviour is interesting, but I have no time for that at the moment. Let us compare them to something new instead.

From SPSC to MPSC

There's a few modifications required to make the SPSC implemented in previous posts into an MPSC. Significantly we need to control access to the tail such that publishers don't over write each others elements in the queue and are successful in incrementing the tail in a sequential manner. To do that we will:

Have to replace the prev. lazy setting of the tail to it's new value (the privilege of single writers) with a CAS loop similar to the one employed in AtomicLong.
Once we successfully claimed our slot we will write to it. Note that this is changing the order of events to "claim then write" from "write then claim".
To ensure the orderly write of the element we will use a lazy set into the array

Here's the code:

I've dropped the use of a head cache, it's a pain to maintain across multiple threads. Lets start with this and consider reintroducing it later. In theory it should be a help as the queue is mostly empty.

Because of the above reversal of writes we now must contend in the consumer with the possibility of the tail index being visible before the element is:

NOTE: This code does NOT implement Queue::poll according to the contract specified in the interface. In particular the method may return null when the queue is not empty, but the next element is not yet visible. This is expanded, explored and lamented in the following post: Poll me, maybe?. See the JCTools MpscArrayQueue implementation for Queue interface compliant code.
Since the tail is no longer the definitive indicator of presence in the queue we can side step the issue by using the FastFlow algorithm for poll which simply tests the head element is not null without inspecting the tail counter. Note that the index is still the definitive indicator of queue emptiness: (tail != head).
Since the producers are interested in the value of head we must orderly publish it by using lazy set otherwise it will be open to be assigned to a register...

The above code is worth working through and making sure you understand the reasoning behind every line (ask in the comments if it makes no sense). Hopefully I do too ;-). Was it worth it?

We get some improvement beyond CLQ (x2.5 with 1 producer, x1.7-1.3 as more producers are added), and that is maintained as threads are added, but we are basically stuck on the same issue CLQ is as the producers pile up, namely that we are spinning around the CAS. These queue basically scale like AtomicLong increments from multiple threads, this has been discuss previously here, benchmarking a counter is an included sample in JMH so I took the opportunity to checkout the latest and greatest in that awesome project and just tweaked it to my needs. Here is what AtomicLong scales like:

It's even worse than the queues at some point because the producer threads at least have other stuff to do apart from beating each other up. How can we contend with this issue?

A CAS spin backoff strategy

If you've used the Disruptor before you will find the idea familiar, instead of naively spinning we will allow some configuration of what to do when we are not getting our way, in the hope that time heals all things. I introduced the strategy in the form of an enum, but it's much of a muchness really:

And this is what the offer looks like with the CAS miss strategy in place:

There's a trade off here to be considered between a producers latency and the overall throughput of the system, I have not had time to experiment to that effect. Here's how the different back off strategies do:

As we can see the Park strategy offers the best throughput and seems to maintain it as the number of threads increase. The end result is a fair consistent 50M ops/sec throughput.

Notes and Reservations

I've not had time yet to explore this queue/problem as much as I would like and as such this post should be seen a rough first stab rather than an accomplished summary. In particular:

I'm unhappy with the benchmark/test harness. The producer threads are bound to start and stop out of sync with each other which will lead to uneven levels of contention on the queue. A better way to measure the throughput would require taking multiple samples during a run and discarding the samples where not all producers are offering.
Given time I would have liked to test on larger producer counts, at the very least expanding the graphs to 20 or so producers.
While running the benchmarks I encountered some run to run variation. A larger data set of runs would have been the right way to go.

That said I think the result is interesting and I plan to explore further variations on this theme. In particular:

An MPSC queue with looser FIFO requirements would allow much greater throughput. The challenges there are avoiding producer starvation and offering a degree of timeliness/fairness.
In producing the MPSC queue I have taken another dive into the Disruptor code, a comparison between the Disruptor and the MPSC queue might prove interesting.

Finally, I was intending to port some of these rough experiments into an open source collection of queues for general use (some readers expressed interest). This will happen in the near future, time and space permitting.

Monday 7 October 2013

SPSC revisited part III - FastFlow + Sparse Data

{This post is part of a long running series on lock free queues, checkout the full index to get more context here}
After I published the last post on this topic (if you want to catchup on the whole series see references below) I got an interesting comment from Pressenna Sockalingasamy who is busy porting the FastFlow library to Java. He shared his current implementation for porting the FastFlow SPSC queue (called FFBuffer) and pointed me at a great paper on it's implementation. The results Pressenna was quoting were very impressive, but when I had a close look at the code I found what I believed to be a concurrency issue, namely there were no memory barriers employed in the Java port. Our exchange is in the comments of the previous post. My thoughts on the significance of memory barriers, and why you can't just not put them in are recorded here.
Apart from omitting the memory barriers the FFBuffer offered two very interesting optimizations:

Simplified offer/poll logic by using the data in the queue as an indicator. This removes the need for tracking the relative head/tail position and using those to detect full/empty conditions.
Sparse use of the queue to reduce contention when the queue is near full or near empty (see musings below). This bloats the queue, but guarantees there are less elements per cache line (this one is an improvement put in by Pressenna himself and not part of the original)

So after getting off my "Memory Barriers are important" high horse, I stopped yapping and got to work seeing what I can squeeze from these new found toys.

FFBuffer - Simplified Logic

The simplified logic is really quite compelling. If you take the time to read the paper, it describes the queue with offer/poll loop I've been using until now as Lamport's circular buffer (the original paper by Lamport is here, not an easy read), full code is here, but the gist is:

I've been using a modified version which I got from a set of examples put forth by Martin Thompson as part of his lock free algos presentation and evolved further as previously discussed(padding is omitted for brevity):

With the FFBuffer variant Pressenna sent my way we no longer need to do all that accounting around the wrap point (padding is omitted for brevity):

The new logic is simpler, has less moving parts (no head/tail caching) and in theory looks like cache coherency traffic should be reduced by only bringing in the array elements which are required in any case.

The Near Empty/Full Queue Problem

At some point while testing the different queues I figured the test harness itself may be to blame for the run to run variance I was seeing. The fun and games I had are a story to be told some other time, but one of the thing that emerged from these variations on the test harness was the conclusion that testing a near empty queue produces very different results. By reversing the thread relationship (Main is producer, other thread is consumer, code here) in the test harness I could see great performance gains, the difference being that the producer was getting a head start in the race, creating more slack to work with. This was happening because the consumer and the producer are experiencing contention and false sharing when the queue has 1 to 16 elements in it. The fact that the consumer is reading a cache row that is being changed is bad enough, but it is also changing the same line by setting the old element to null. The nulling out of elements as they are read is required to avoid memory leaks. Also note that when the consumer/producer run out of space to read/write as indicated by their cached view of the tail/head they will most likely experience a read miss with the implications discussed previously here.
Several solutions present themselves:

Batch publication: This is a problem for latency sensitive applications and would require the introduction of some timely flush mechanism to the producer. This is not an issue for throughput focused applications. Implementing this solution will involve the introduction of some batch array on the producer side which is 'flushed' into the queue when full. If the batch array is longer than a cache line the sharing issue is resolved. An alternative solution is to maintain a constant distance between producer and consumer by changing the expectation on the consumer side to stop reading at a given distance.
Consumer batch nulling: Delay the nulling out of read elements, this would reduce the contention but could also result in a memory leak if some elements are never nulled out. This also means the near empty queue is still subject to a high level of coherency traffic as the line mutates under the feet of the reader.
Padding the used elements: We've taken this approach thus far to avoid contention around fields, we can apply the same within the data buffer itself. This will require us to use only one in N cells in the array. We will be trading space for contention, again. This is the approach taken above and it proves to work well.

Good fences make good neighbours

As mentioned earlier, the code above has no memory barriers, which made it very fast but also incorrect in a couple of rather important ways. Having no memory barriers in place means this queue can:

Publish half constructed objects, which is generally not what people expect from a queue. In a way it means that notionally stuff can appear on the other side of the queue before you sent it (because the code is free to be reordered by the compiler).
The code can be reordered such that the consumer ends up in an infinite busy loop (the read value from the array can be hoisted out of the reading loop).

I introduced the memory barriers by making the array read volatile and the array write lazy. Note that in the C variation only a Write Memory Barrier is used, I have implemented variations to that effect but while testing those Pressenna informed me they hang when reading/writing in a busy loop instead of yielding (more on this later). Here's the final version (padding is omitted for brevity):

There are a few subtleties explored here:

I'm using Unsafe for array access, this is nothing new and is a cut and paste job from the AtomicReferenceArray JDK class. This means I've opted out of the array bound checks we get when we use arrays in Java, but in this case it's fine since the ring buffer wrapping logic already assures correctness there. This is necessary in order to gain access to getVolatile/putOrdered.
I switched Pressanas original field padding method with mine, mostly for consistency but also it's a more stable method of padding fields (read more on memory layout here).
I doubled the padding to protect against pre-fetch caused false sharing (padding is omitted above, but have a look at the code).
I replaced the POW final field with a constant (ELEMENT_SHIFT). This proved surprisingly significant in this case. Final fields are not optimized as aggressively as constants, partly due to the exploited backdoor in Java allowing the modification of final fields (here's Cliff Click's rant on the topic). ELEMENT_SHIFT is the shift for the sparse data and the shift for elements in the array (inferred from Unsafe.arrayIndexScale(Object[].class)) combined.

How well did this work? Here is a chart comparing it to the the latest queue implementation I had (Y5 with/Y4 without Unsafe array access), with sparse data shift set to 0 and 2 (benchmarked on Ubuntu13.04/JDK7u40/i7-4700MQ@2.40GHz using JVM params: -XX:+UseCondCardMark -XX:CompileThreshold=100000 and pinned to run across cores, results from 30 runs are sorted to make the the charts more readable):

Y=Ops per second, X=benchmark run

Shazzam! This is a damn fine improvement note that the algorithm switch is far less significant than the use of sparse data.
To further examine the source of the gain I benchmarked a variant with the element shift parametrized and a further variant with single padding (same machine and settings as above, sparse shift is 2 for all):

Y=Ops per second, X=benchmark run

Single padding seems to make little difference but parametrization is quite significant.

Applying lessons learnt

Along the way, as I was refining my understanding of the FFBuffer optimizations made above I applied the same to the original implementation I had. Here are the incremental steps:

Y4 - Starting point shown above.
Y5 - Use Unsafe to access array (regression).
Y6 - Use sparse elements in the buffer.
Y7 - Double pad the fields/class/array.
Y8 - Put head/tailCache on same line, put tail/headCache on another line.
Y81 - From Y8, Parametrize sparse shift instead of constant (regression).
Y82 - From Y8, Revert to using array instead of Unsafe access (regression).
Y83 - From Y8, Revert double padding of fields/class/array (regression).

Lets start with the journey from Y4 to Y8:

Y=Ops per second, X=benchmark run

Note the end result is slightly better than the FF one, though slightly less stable. Here is a comparison between Y8 and it's regressions Y81 to Y83:

Y=Ops per second, X=benchmark run

So... what does that tell us?

The initial switch to using Unsafe for array access was a regression as a first step (Y5), but proves a major boost in the final version. This is the Y82 line. The step change in its performance I think is down to the JIT compiler lucking out on array boundary check elimination, but I didn't check the assembly to verify.
Parametrizing is an issue here as well, more so than for the FFBuffer implementation for some reason.
I expected Y83 (single padding) to be the same or worse than Y6. The shape is similar but it is in fact much better. I'm not sure why...
There was a significant hiccup in the field layout of Y4/5/6 resulting in the bump we see in going from Y6 to Y7 and the difference between Y83 and Y6. As of now I'm not entirely certain what is in play here. It's clear that pre-fetch is part of it, but I still feel I'm missing something.

So certainly not all is clear in this race to the top but the end result is a shiny one, let's take a closer look at the differences between Y8 and FFO3. Here's a comparison of these 2 with sparse data set to 0,1,2,3,4 running cross cores:

From these results it seems that apart from peaking for the s=2 value, Y8 is generally slower than FFO3. FFO3 seems to improve up to s=3, then drop again for s=4. To me this suggests FFO3 is less sensitive to the effect of the near empty queue as an algorithm.
Next I compared the relative performance of the 2 queue when running on the same core, this time only s=0, 2 where considered:

This demonstrates that FFO3 is superior when threads are running on the same core and still benefits from sparse data when running in that mode. Y8 is less successful and seem to actually degrade in this scenario when sparse data is used. Y8 offers a further optimization direction I have yet to fully explore in the form of batching consumer reads, I will get back to it in a further post.

On Testing And A Word Of Warning

I've been toying around with several test harnesses which are all on GitHub for you to enjoy. The scripts folder also allows running the tests in different modes (you may want to adjust the taskset values to match your architecture). Please have a play. Mostly the harnesses differ in the way they measure the scope of execution and which thread (producer/consumer) acts as the trigger for starting the test and so on. Importantly you will find there is the BusyQueuePerfTest which removes the Thread.yield() call when an unsuccessful offer/poll happens.
During the long period in which this post has been in the making (I was busy, wrote other posts... I have a day job... it took forever) Pressana has not been idle and got back to me with news that all the implementations which utilize Unsafe to access the array (Y5/SPSCQueue5 and later, most of the FFBuffer implementations) were in fact broken when running the busy spin test and would regularly hang indefinitely. I tested locally and found I could not reproduce the issue at all for the SPSCQueue variations and could only reproduce it for the FFBuffer implementations which used no fences (the original) or only the STORE/STORE barrier (FFO1/2). Between us we have managed to verify that this is not an issue for JDK7u40 which had a great number of concurrency related bug fixes in it, but is an issue for previous versions! I strongly suggest you keep this in mind and test thoroughly if you intend to use this code in production.

Summary

This post has been written over a long period, please forgive any inconsistencies in style. It covers a long period of experimentation and is perhaps not detailed enough... please ask for clarifications in comments and I will do my best to answer. Ultimately, the source is there for you to examine and play with and has the steps broken down such that a diff between step should be helpful.

This is also now part 5 in a series of long running posts, so if you missed any or forgotten it all by now here they are:

Single Producer/Consumer lock free Queue step by step: This is covering the initial move from tradition JDK queues to the lock free variation used as a basis for most of the other posts
135 Million messages a second between processes in pure Java: Taking the queue presented in the first post and taking it offheap and demonstrating its use as an IPC transport via a memory mapped file
Single Producer Single Consumer Queue Revisited: An empiricist tale - part I: Looking at run to run variation in performance in the original implementation and trying to cure it by inlining the counter fields into the queue class
SPSC Revisited - part II: Floating vs Inlined Counters: Here I tried to access to impact of inlining beyond fixing up the initial implementations exposure to false sharing on one side.

And here we are :-), mostly done I think.

I would like to thank Pressana for his feedback, contribution and helpful discussion. He has a website with some of his stuff on, he is awesome! As mentioned above the idea for using sparse data is his own (quoting from an e-mail):

"All I tried was to cache align access to the elements in the array as it is done in plenty of c/c++ implementations. I myself use arrays of structs to ensure cache alignment in C. The same principle is applied here in Java where we don't have structs but only an array of pointer. "

Finally, the end result (FFO3/Y8) is incredibly sensitive to code changes as the difference between using a constant and field demonstrates, but also performs very consistently between runs. As such I feel there is little scope for improvement left there... so maybe now I can get on to some other topics ;-).

Sunday 15 September 2013

Diving Deeper into Cache Coherency

I recently gave a talk about Mechnical Sympathy at Facebook, which was mostly a look at the topic through the SPSC queue optimization series of posts. During the talk I presented the optimisation step (taken in Martin's original series of queue implementations) of adding a cache field for the head and tail. Moving from this:

To this:

This change yields a great improvement in throughput which I explained to be down to the reduction in coherency traffic between the cores. This is:

Not what I stated previously here.
A very brief and unsatisfying explanation.

I shall try and remedy that in this post.

Penance

In my original step by step post I explained the benefit of the above optimization is the replacement of a volatile read with a plain read. Martin was kind enough to gently correct me in the comments:

"The "volatile read" is not so much the issue. The real issues come from the read of the head or tail from the opposite end will pretty much always result in a cache miss. This is because the other side is constantly changing it."

and I argued back, stating the JIT compiler can hoist the field into a register and wistfully wishing I could come up with an experiment to prove one way or the other. I was mostly wrong and Martin was totally right.
I say mostly and not entirely because eliminating volatile reads is not a bad optimization and is responsible for part of the perf improvement. But it is not the lion share. To prove this point I have hacked the original version to introduce volatile reads (but still plain writes) from the headCache/tailCache fields. I won't bore you with the code, it's using the field offset and Unsafe.getLongVolatile to do it.
I ran the test cross core for 30 times and averaged the summary results. Running on new hardware (so no reference point to prev. quoted results) i7-4700MQ/Ubuntu 13.04/JDK7u25. Here are the results, comparing Original21 (head/tail padded, no cache fields), Original3 (head/tail padded, head/tailCache padded) and VolatileRead (same as Original3, but with volatile read of cache fields):

Original21: 65M ops/sec
Original3: 215M ops/sec
VolatileRead: 198M ops/sec

As we can see, there is no massive difference between Original3 and VolatileRead (especially when considering the difference from Original21), leading me to accept I need to buy Martin a beer, and apologize to any reader of this blog who got the wrong idea reading my post.
Exercise to the reader: I was going to run the above with perf to quantify the impact on cache-misses, but as perf doesn't support this functionality on Haswell I had to let it go. If someone cares to go through the exercise and post a comment it would be most welcome.

What mean cache coherency?

Moving right along, now that we accept the volatile read reduction is not the significant optimization here, let us dig deeper into this cache coherency traffic business. I'll avoid repeating what is well covered elsewhere (like cache-coherence, or MESIF on wikipedia, or this paper, some excellent examples here, and an animation for windows users only here) and summarize:

MESI State Diagram

Modern CPUs employ a set of caches to improve memory access latency.
To present us with a consistent view of the world when concurrent access to memory happens the caches must communicate to give us a coherent state.
There are variations on the protocol used. MESI is the basic one. Recent intels use MESIF. In the examples below I use MESI as the added state in MESIF adds nothing to the use case.
The guarantee they go for is basic:

One change at a time (to a cache line): A line is never in M state in more than one place.
Let me know if I need a new copy: If I have a copy and someone else changed it I need to get me a new copy. (My copy will move from Shared to Invalid)

All of this happens under the hood of your CPU without you needing to lose any sleep, but if we somehow cause massive amounts of cache coherency traffic then our performance will suffer. One note about the state diagram. The observant reader (thanks Darach) will notice the I to E transition for Processor Write(PW) is in red. This is not my diagram, but I think the reason behind it is that the transition is very far from trivial (or a right pain in the arse really) as demonstrated below.

False Sharing (once more)

False sharing is a manifestation of this issue where 2 threads are competing to write to the same cache line which is constantly Invalid in their own cache.

Here's a blow by blow account of false sharing in terms of cache coherency traffic:

Thread1 and Thread2 both have the false Shared cache line in their cache
Thread1 modifies HEAD (state goes from S to E), Thread2's copy is now Invalid
Thread2 wants to modify TAIL but his cache line is now in state I, so experiences a write miss:

Issue a Read With Intent To Modify (RWITM)
RWITM intercepted by Thread1
Thread1 Invalidates own copy
Thread1 writes line to memory
Thread2 re-issues RWITM which results in read from memory (or next layer cache)

Thread1 wants to write to HEAD :( same song and dance again, the above gets in it's way etc.

The diagram to the right is mine, note the numbers on the left track the explanation and the idea was to show the cache lines as they mutate on a timeline. Hope it helps :-).

Reducing Read Misses

So False Sharing is really bad, but this is not about False Sharing (for a change). The principal is similar though. With False Sharing we get Write/Read misses, the above manoeuvre is about eliminating Read misses. This is what happens without the HEADC/TAILC fields (C for cache):

Thread1 and Thread2 both have the HEAD/TAIL cache lines in Shared state in their cache
Thread1 modifies HEAD (state goes from S to E), Thread2's copy is now Invalid
Thread2 wants to read HEAD to check if the queue wrapped, experiences a Read Miss:

Thread2 issues a request for the HEAD cache line
Read is picked up by Thread1
Thread1 delivers the HEAD cache line to Thread2, the request to memory is dropped
The cache line is written out to main memory and now Thread1/2 have the line in Shared state

This is not as bad as before, but is still far from ideal. The thing is we don't need the latest value of HEAD/TAIL to make progress, it's enough to know that the next increment to our counter is available to be used. So in this use case, caching the head/tail gives us a great win by avoiding all of these Read Misses, we only hit the Read Miss when we run out of progress to be made against our cached values. This looks very different:

Thread1 and Thread2 both have the HEAD/TAIL/HEADC/TAILC cache lines in Shared state in their cache (the TAILC is only in Thread1, HEADC is only in Thread2, but for symmetry they are included).
Thread1 reads TAILC and modifies HEAD (state goes from S to E), Thread2's copy is now Invalid.
Thread2 reads from HEADC which is not changed. If TAIL is less than HEADC we can make progress (offer elements) with no further delay. This continues until TAIL is equal to HEADC. As long as this is the case HEAD is not read, leaving it in state M in Thread1's cache. Similarly TAIL is kept in state M and Thread2 can make progress. This is when we get the great performance boost as the threads stay out of each others way.
Thread1 now needs to read TAIL to check if the queue wrapped, experiences a Read Miss:

Thread1 issues a request for the TAIL cache line
Request is picked up by Thread2
Thread2 delivers the TAIL cache line to Thread1, the request to memory is dropped
The cache line is written out to main memory and now Thread1/2 have the line in Shared state
TAIL is written to TAILC (TAILC line is now E)

The important thing is that we spend most of our time not needing the latest HEAD value which means Thread2 is not experiencing as many Read Misses, with the added benefit of the threads not having to go back and forth from Shared to Exclusive/Modified state on writes. Once I have a line in the M state I can just keep modifying with no need for any coherence traffic. Also note we never experience a Write Miss.

Summary

In an ideal application (from a cache coherency performance POV) threads never/rarely hit a read or write miss. This translates to having single writers to any piece of data, and minimizing reads to any external data which may be rapidly changing. When we are able to achieve this state we are truly bound by our CPU (rather than memory access).

The takeaway here is: Fast moving data should be in M state as much as possible. A read/write miss by any other thread competing for that line will lead to reverting to S/I which can have significant performance implications. The above example demonstrates how this can achieved by caching stale but usable copies locally to another thread.