Psychosomatic, Lobotomy, Saw: January 2014

Tuesday 28 January 2014

Picking the 2013 SPSC queue champion: benchmarking woes

{This post is part of a long running series on lock free queues, checkout the full index to get more context here}
I put FFBufferWithOfferBatch together as part of looking into the performance of BQueue and what with all the excitement about JAQ I never introduced it properly. As it's the best performing SPSC(Single Producer Single Consumer) queue I have thus far I thought I should quickly cover it and tell you some horrific benchmarking tales of how I determined it's the best thus far.

Offer like a BQueue

The BQueue was previously presented and discussed here. Offer logic is simple and improves on the original FFBuffer by ensuring clearance to write ahead for the next OFFER_BATCH_SIZE elements. This improvement has 3 effects:

Most of the time we can write without reading and checking for null. We do one volatile read for every OFFER_BATCH_SIZE elements, the rest of the time we compare to the cached field which is a normal field that can be held in a register.
We avoid hitting the full queue induced contention by keeping a certain distance from the head of the queue. Contention can still happen, but it should be quite rare.
We potentially miss out on part of the queue capacity as we will declare the queue full if there are less than OFFER_BATCH_SIZE available when we hit the cached value.

Overall this is not a bad deal.

Poll like an FFBuffer

This poll method is again quite simple. Offset hides the sparse data shift trick that is discussed in depth here. The sparse data trick is a way of reducing contention when nearing the empty queue state. With a sparse shift of 2 only 4 elements in the queue buffer actually share the same cache line (as opposed to 16 when data is not sparse) reducing the frequency with which that state occurs. This has a side effect of hurting memory throughput as each cache line we load has less data in it.

Benchmarks: A Misery Wrapped In An Enema

I've been benchmarking these queues over a long period of time and it is a learning experience (i.e: very painful). Benchmarking is just generally a complex matter, but when it comes to examining and comparing code where a single operation (offer/poll) is in the nano-seconds I quote Mr. Shipilev: "nano-benchmarks are the damned beasts". Now, consider a multi-threaded nano-benchmark...
Let's start with naming the contestants:

CAQ - InlinedCountersConcurrentArrayQueue - My final version of Martin Thompsons SPSC algorithm with some extra padding, inlined counters and support for sparse data. This queue was presented in length previously (Intro, benchmarks, inlined vs. floating counters)
FF1 - FFBuffer - A port to Java of the Fast Flow SPSC. Full padding and support for sparse data. This queue was previously discussed here.
FF2 - FFBufferWithOfferBatch - described above with and without sparse data.

Now, the first benchmark I used to quantify the performance of these queues was the one I got from Martin's examples project. I stuck with it for consistency (some cosmetic changes perhaps), though I had some reservations about the section under measurement. Here's how the measured section goes:

The benchmark measures the time it takes for REPETITIONS number of 'messages' to be sent from the producer to the consumer. If the consumer find that the queue is empty it will Thread.yield and try again, similarly if the producer finds that the queue is full it will Thread.yield and try again.
Here's how the queues perform with that benchmark (running on Ubuntu13.10/JDK7u45/i7-4300MQ@2.4 no turbo boost, pinned across cores using taskset, with JVM options: '-server -XX:+UseCondCardMark -XX:CompileThreshold=100000'):
CAQ(sparse=0) - 130M ops/sec
FF1(sparse=0) - 155M ops/sec
FF2(sparse=0) - 185M ops/sec
CAQ(sparse=2) - 288M ops/sec
FF1(sparse=2) - 282M ops/sec
FF2(sparse=2) - 290M ops/sec

When I started on JAQ I took the same benchmark, but changed a few things. I moved the start timestamp to the producing thread and moved the thread join out of the measured section, I also added some counters for when the producer fails to offer (queue is full) and the consumer fails to poll (queue is empty). Finally I took this same benchmark and ported it to use my very own ConcurrentQueue interface. Here's how it looked:

You would think this sort of change could only make performance slightly worse, if it made any difference at all. Certainly that's what I thought, but I was wrong... With the new benchmark I got the following results:
FF2(sparse=0) - 345M ops/sec

FF2(sparse=2) - 327M ops/sec

WTF? What is the difference?

I looked long and hard at the benchmark and realized the only difference, that shouldn't really make a difference, but the only difference that does make a difference is the localization of the queue field to the producer loop (found by process of elimination). Tweaking the original benchmark to localize the queue reference gives us different results for all queues:
CAQ(sparse=0) - 291M ops/sec
FF1(sparse=0) - 190M ops/sec
FF2(sparse=0) - 348M ops/sec
CAQ(sparse=2) - 303M ops/sec
FF1(sparse=2) - 287M ops/sec
FF2(sparse=2) - 330M ops/sec

So we notice things have not improved much for FF1, but for CAQ we now have only a marginal difference between using sparse data and not using it, and for FF2 it is actually better not to bother at all with sparse data. The localization of the queue reference made the producer faster, reducing the number of times the empty queue state is hit and thus reducing the need for sparse data. We can try to validate this claim by running the Queue variation of this benchmark with the counters. With the reference localized:
CAQ(sparse=0) - 291M ops/sec, poll fails 350, offer fails 0
FF1(sparse=0) - 192M ops/sec, poll fails 32000, offer fails 0

FF2(sparse=0) - 348M ops/sec, poll fails 150, offer fails 13000
CAQ(sparse=2) - 303M ops/sec, poll fails 270, offer fails 0
FF1(sparse=2) - 287M ops/sec, poll fails 200, offer fails 0

FF2(sparse=2) - 330M ops/sec, poll fails 170, offer fails 10

So we can see that adding the extra counters made little difference with the reference localized, but when referencing the field directly in the producer loop:

CAQ(sparse=0) - 167M ops/sec, poll fails 2400, offer fails 0
FF1(sparse=0) - 160M ops/sec, poll fails 100000, offer fails 0

FF2(sparse=0) - 220M ops/sec, poll fails 2000, offer fails 0
CAQ(sparse=2) - 164M ops/sec, poll fails 31000, offer fails 0
FF1(sparse=2) - 250M ops/sec, poll fails 2000, offer fails 0

FF2(sparse=2) - 255M ops/sec, poll fails 500, offer fails 0

We get a different picture again, in particular for CAQ. To get to the bottom of the exact effect localizing the reference had on this benchmark I will have to dig into the generated assembly. This will have to wait for another post...

Conclusions And Confusions

The overall winner, with all the different variations on the throughput benchmarks, is the FFBufferWithOfferBatch. It's also the winner of the previously presented/discussed latency benchmarks(part 1, part 2). With turbo boost on it hits a remarkable high of 470M ops/sec. But setting this result to one side, the above results highlight a flaw in the throughput benchmark that is worth considering in the context of other concurrent benchmarks, namely that changing the [queue] implementation under test can change the benchmark.
Let me elaborate on this last point. What I read into the results above is that the benchmark was initially a benchmark where the queue full state was never hit, and the queue empty state was hit more or less depending on the queue implementation. Given FF2 is only a change to the offer method of FF1 we can see how tilting the balance between the offer cost and poll cost changed the nature of the test. In particular when using no sparse data the producer turned out to be significantly faster than the consumer. But... in changing the implementation we have unbalanced the benchmark, it is hardly ever hitting an empty queue, which would slow down the consumer/producer threads. We have switched to measuring the full speed of consumption as a bottleneck, but only for one queue implementation. So this is, in a way, not the same benchmark for all queues.
While this benchmark is still useful to my mind, it is worth considering that it is in fact benchmarking different use cases for different queues and that the behaviour of the using code will inevitably balance the producer/consumer costs quite differently. All in all, yet another case of your mileage may vary ;-)

Thanks to Peter, Darach and Mr. Gomez

Thursday 16 January 2014

JAQ: Using JMH to Benchmark SPSC Queues Latency - Part II

{This post is part of a long running series on lock free queues, checkout the full index to get more context here}
{If you come here looking for JMH related content start at the new and improved JMH Resources Page and branch out from there!}
Just quickly sharing more results/insights from running different configuration of the SPSC latency benchmark discussed in the previous post. The previous post reviewed the different implementations behaviour when sending different sizes of bursts, in this post I'll have a look at the impact of the length of the chain (number of threads passing the messages from one to the other returning to the originator) on latency. The benchmarks and queues stay the same, so you may have to skip back to the previous post for context.

TL;DR

This post may not be of great popular appeal... it's a verification of the behaviour of current implementations of SPSC in terms of single threaded operation costs and RTT latency for different 'trip' lengths. The main finding of interest (for me) is that the sparse data method previously discussed here is proving to be a hindrance to latency performance. This finding is discussed briefly at the end and is something I may dig into more deeply in the near future. Other more minor nuggets were found... but anyhow, lets get to it.

Chain length 1: The cost of talking to yourself

When I had Darach review the original post he pointed out to me that I neglected to cover this edge case in which a queue is being used by a single thread acting as both the consumer and the producer. The result is interesting as a baseline of operation costs when no concurrency is involved. The benchmark is very simple:

The results give us a sense of the cost of [(offer * burst size) + (poll * burst size)] for the different implementations. I added one of my favourite collections, the ArrayDequeue, to compare the efficiency of a non-thread safe queue implementation with my furry gremlins. Here's the score for AD vs CLQ for the different burst sizes:

This is a bit of a bland presentation, so feel free to call me names (you bland representor of numbers!). The numbers are here if you want to play. All the numbers are in nanos per op, mean error in brackets. Benchmark run on laptop(Ubuntu13.10/JDK7u45/i7-4700MQ@2.4 capped). Here's the score for CAQ/FF2 with sparse shift set to 0 and 2:

Points for consideration:

CLQ (ConcurrentLinkedQueue) is the worst. This is completely the wrong tool for the job which is why this result should be a lesson in why using concurrent data structures where you don't need them will suck.
Overall AD(ArrayDequeue) is the best from burst size 4 and up. This is the right tool for the job when your consumer and producer are the same thread. Choose the right tool, get the right result.
CAQ(ConcurrentArrayQueue) and FF2(FFBufferWithOfferBatch) start off as similar or better than AD, but soon fall behind. Interestingly FF2 starts as the winner of the 3 and ends as the loser with CAQ overtaking it from burst size 64 and onward.

I found it interesting to see how the behaviour changed as the burst size changes, and I thought the single threaded cost estimates were valuable as enablers of other estimates, but overall this is not a benchmark to lose sleep over.
Note the CAQ/FF2 are measured in 2 configurations with and without sparse data. The use of sparse data is counter productive in this use case as it aims to reduce contention that is not there, and inhibits throughput in the process.

Stable Chains: Working across the cores

Due to the limits of my benchmarking machine I can only demonstrate stable behaviour for chains of length 1 to 4 without resorting to pinning threads to cores. Still, the results offer some insight. The benchmark in use is the one discussed previously here. The whole process is pinned such that the only available inter-thread channel is across cores. On my machine I have a hyper-threaded quad-core CPU so 8 logical cores from a taskset perspective. The even/odd numbers are on different cores. For my runs I pinned all other processes to core 0 and pinned the JMH process to different cores (as many as I needed from 1,3,5,7). This is not ideal, but is workable.

I didn't bother with CLQ for this set of results. Just CAQ and FF2. The benchmark code is here.
Here's chain length 2 (T1 -> T2 -> T1):

Here's chain length 3 (T1 -> T2 -> T3 -> T1):

Here's chain length 4 (T1 -> T2 -> T3 -> T4 -> T1):

This is data collected from (10 warmup iterations + 10 iterations) * 10 forks, so benchmarking took a while. This should mean the data is slightly more than anecdotal...
Conclusions:

FF2 is consistent proving to be the best queue in terms of inter-thread latency. This advantage is maintained for all burst sizes and chain lengths.
Sparse data is only marginally improving results consistently for burst size 1. After that it seems to have a small positive effect on CAQ with chain length 3/4 with small bursts. This may be a quirk of the use case demonstrated by the benchmark but it at least shows this method has a demonstrable down side.
The burst RTT does not grow linearly with the burst size, in particular for the smaller bursts. This is due to the bursts filling up the time-to-notify period with write activity.

Here's the latency for FF2(sparse=0) across all chains:

Single threaded cost grows linearly with size, this is how we are used to think about cost. It's also a lower boundary for the round trip use case as the round trip requires at least the amount of work done in the single thread case to still be done in the measuring/source thread.
Once we increase the chain length we get this initial plateau steadily increasing in slope to become linear again. Note that for the 2 threaded case the costs are nearly the same as the single thread case from burst size 512.

Where to next?

I've had a long break over Xmas, some progress was made on JAQ, but not as much as I'd have liked... I actually spent most of the time away from a keyboard. There are now MPSC/SPMC implementations, and a direct ConcurrentQueue implementation for SPSC hooked into the factory and an MPSC one nearing completion. I've had some interest from different people and am working towards meeting their requirements/needs... we'll get there eventually :-)

Thanks Norman M. for his review (he did tell me to add graphs... my bad)