Psychosomatic, Lobotomy, Saw: Linked Array Queues, part 2: SPSC Benchmarks

JCTools has a bunch of benchmarks we use to stress test the queues and evaluate optimizations.

These are of course not 'real' workloads, but serve to highlight imperfections and opportunities. While it is true that an optimization might work in a benchmark but not in the real world, a benchmark can work as a demonstration that there are at least circumstances in which it does work. All measurement is imperfect, but not as imperfect as claims made with no fucking evidence whatsoever, so here goes.

How do these linked-array queues fare in the benchmarks? what can we learn here?

The linked array queues are a hybrid of the array and linked queues. So it seems reasonable that we should compare them to both SpscArrayQueue and SpscLinkedQueue. We should also consider how the queues differ and see if we can flush out the differences via the benchmarks.
If you crack under the pressure of boring details, skip to the summary, do not stop at interlude, do not collect a cool drink or get praise, just be on yer fuckin' merry way.

Setup:

Benchmarks are run on a quiet server class machine:

Xeon processor(Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz): 2 CPUs x 12 cores x 2 threads (HT)
CentOS
Oracle JDK8u101
All benchmarks are run taskset to cores on the same numa node, but such that threads cannot share the same physical core.
Turbo boost is off, the scaling governor is userspace and the frequency is fixed.
The code is on github

Throughput benchmark: background and method

A throughput benchmark for queues is a tricky fucker. In particular the results change meaning depending on the balance between consumer and producer:

If the consumer is faster than the producer we are measuring empty queue contention (producer/consumer hitting the same cache line for elements in the queue, perhaps sampling each other index). Empty queues are the expected state for responsive applications.
If the producer is faster than the consumer we are measuring full queue contention, which may have similar issues. For some queues which optimize for the healthy assumption that queues are mostly empty this may be a particularly bad place to be.
If the producer and consumer are well balanced we are testing a streaming use case which offers the most opportunities for progress for both consumer and producer. This should yield the best performance, but for most applications may not be a realistic scenario at all.

The JCTools throughput benchmark does not resolve these issues. It does however report results which give us an idea of poll/offer failure rates which are in turn indicative of which state we find ourselves in.
A further challenge in managed runtime environments, which is unrelated to queues, is that garbage generating benchmarks will have GC state accumulate across measurement iterations. The implication is that each iteration is measuring from a different starting state. Naturally occurring GCs will leave the heap in varying states depending on the point at which they hit. We can choose to either embrace the noise in the measurement as an averaging of the cost/overhead of garbage or allocate a large enough heap to accommodate a single iteration worth of allocation and force a full GC per iteration, thus resetting the state per iteration. The benchmarks below were run with 8g heap and a GC cycle between iterations.
The benchmark I run here is the no backoff version of the throughput benchmark where failure to offer/poll makes no attempt at waiting/yielding/tapping of foot and just tries again straight away. This serves to maximize contention and is not a recipe for happiness in real applications.
JMH parameters common to all runs below:

-gc true -> GC cycle between iterations
-jvmArgs="-Xmx8g -Xms8g" -> 8g heap
-i 10 -r 1 -> 10 measurement iterations, 1 second each
-wi 20 -w 1 -> 20 warmup iterations, 1 second each
-f 5 -> five forks each to expose run to run variance

Throughput benchmark: baseline(JMH params: -bm thrpt -tu us)

Here's some baseline results, note the unit is ops/us equal to millions of ops per second:
SpscArrayQueue (128k capacity)
offersFailed 0.005 ± 0.008 ops/us
offersMade 252.201 ± 1.649 ops/us
pollsFailed 0.009 ± 0.008 ops/us
pollsMade 252.129 ± 1.646 ops/us

So the SpscArrayQueue is offering great throughput, and seems pretty well balanced with failed offers/polls sort of cancelling out and low compared to the overall throughput.

SpscLinkedQueue

offersFailed ≈ 0 ops/us
offersMade 14.711 ± 5.897 ops/us
pollsFailed 12.624 ± 8.281 ops/us
pollsMade 14.710 ± 5.896 ops/us

For the SpscLinkedQueue we have no failed offers, since it's an unbounded queue. We do see a fair amount of failed polls. We expect the polls to be faster than the offers as offering pays for allocation of nodes on each element (24b overhead per element), while the poll simply leaves it to the GC to toss it all away.

With this baseline we would expect linked arrays queues performance to be somewhere between the 2 data points above. Unlikely to hit the highs of the preallocated array queue, but hopefully much better than a linked queue.

Throughput benchmark: growable

So assuming we let it grow to 128k, how does the SpscGrowableArrayQueue perform in this benchmark and how much does the initial size impact the performance? CNK here is the initial buffer size. The buffer will double in size when offer fills up a buffer until we hit the max size buffer.

CNK Score Error Units
16 offersFailed 0.006 ± 0.006 ops/us
16 offersMade 183.720 ± 0.450 ops/us
16 pollsFailed 0.003 ± 0.001 ops/us
16 pollsMade 183.592 ± 0.450 ops/us
128 offersFailed 0.003 ± 0.006 ops/us
128 offersMade 184.236 ± 0.336 ops/us
128 pollsFailed 0.003 ± 0.001 ops/us
128 pollsMade 184.107 ± 0.336 ops/us
1K offersFailed 0.001 ± 0.003 ops/us
1K offersMade 183.113 ± 1.385 ops/us
1K pollsFailed 0.003 ± 0.001 ops/us
1K pollsMade 182.985 ± 1.385 ops/us
16K offersFailed 0.007 ± 0.006 ops/us
16K offersMade 181.388 ± 5.380 ops/us
16K pollsFailed 0.004 ± 0.001 ops/us
16K pollsMade 181.259 ± 5.380 ops/us

Under constant streaming pressure the Growable queue will keep growing until either full sized buffer is allocated (very likely) or a smaller buffer in which the throughput is sustainable is found (unlikely for this benchmark as all it takes is a single spike). If that was the case we would have no failing offers. Either way we expect transition to the last buffer to be a short phase after which the algorithm is very similar to SpscArrayQueue and no further allocations happen. The number of resizing events is small, as the buffer doubles each time (so log2(capacity/initial size), e.g. for initial capacity 16k: 16k -> 32k -> 64k -> 128k).
You may consider the slow down from SpscArrayQueue large at roughly 25%, but I don't think it too bad considering that with the throughputs in question we are looking at costs in the single digit nanoseconds where every extra instruction is going to show up (back of envelope: 250 ops/us -> ~4ns per offer/poll vs 180 ops/us -> ~5ns. 1ns = ~3 cycle ~= 12 instructions or 1 L1 load).

Throughput benchmark: chunked

For Chunked we see the expected increase in throughput as we increase the chunk size (CNK is the fixed chunk size, the max size is 128K):

CNK Score Error Units
16 offersFailed ≈ 0 ops/us
16 offersMade 43.665 ± 0.892 ops/us
16 pollsFailed 9.160 ± 0.519 ops/us
16 pollsMade 43.665 ± 0.892 ops/us
128 offersFailed ≈ 10⁻⁴ ops/us
128 offersMade 151.473 ± 18.786 ops/us
128 pollsFailed 0.380 ± 0.331 ops/us
128 pollsMade 151.443 ± 18.778 ops/us
1K offersFailed 0.309 ± 0.375 ops/us
1K offersMade 149.351 ± 14.102 ops/us
1K pollsFailed 0.112 ± 0.125 ops/us
1K pollsMade 149.314 ± 14.120 ops/us
16K offersFailed ≈ 10⁻⁸ ops/us
16K offersMade 175.408 ± 1.563 ops/us
16K pollsFailed 0.038 ± 0.031 ops/us
16K pollsMade 175.394 ± 1.563 ops/us

Note the decline in throughput for smaller chunks is matched with an increase in poll failures indicating that the consumer is becoming faster than the producer as the chunk grows smaller requiring more frequent allocations by the produce.
Note also that even with 16 slot chunks this option is ~3 times faster than the linked alternative.
Under constant streaming pressure the Chunked queue will be pushed to it's maximum size, which means the producer will be constantly allocating buffers. The producer resize conditions are also slightly trickier and require sampling of the consumer index. The consumer will be slowed down by this sampling, and also slowed down by jumping to new buffers. This problem will be worse as more resizing happens, which is a factor of chunk size.
The benefit of larger chunks will cap out at some point, you could explore this parameter to find the optimum.
An exercise to readers: run the benchmark with the JMH GC profiler and compare the queues. Use it to verify the assumption that Growable produces a bounded amount of garbage, while Chunked continues to churn.
Max throughput is slightly behind Growable.

The main take aways for sizing here seem to me that tiny chunks are bad, but even with small/medium chunks you can have pretty decent throughput. The right size for your chunk should therefore depend on your expectations of average traffic on the one hand and desirable size when empty.

Throughput benchmark: unbounded

For unbounded we see the expected increase in throughput as we increase the chunk size (CNK is the chunk size, the max size is infinity and beyond):

CNK Score Error Units
16 offersFailed ≈ 0 ops/us
16 offersMade 56.315 ± 7.563 ops/us
16 pollsFailed 10.823 ± 1.611 ops/us
16 pollsMade 56.315 ± 7.563 ops/us
128 offersFailed ≈ 0 ops/us
128 offersMade 135.119 ± 23.306 ops/us
128 pollsFailed 1.236 ± 0.851 ops/us
128 pollsMade 131.770 ± 21.535 ops/us
1K offersFailed ≈ 0 ops/us
1K offersMade 182.922 ± 3.397 ops/us
1K pollsFailed 0.005 ± 0.003 ops/us
1K pollsMade 176.208 ± 3.221 ops/us
16K offersFailed ≈ 0 ops/us
16K offersMade 177.586 ± 2.929 ops/us
16K pollsFailed 0.031 ± 0.038 ops/us
16K pollsMade 176.884 ± 2.255 ops/us

The 16 chunk size is ~4 times faster than the linked list option, as chunk size increases it gets more efficient.
Max throughput is slightly behind growable.
Why is Chunked faster than Unbounded on 128 chunks, but slower on 1K? I've not looked into it, it's taken long enough to write this bloody post as it is. How about you check it out and let me know?

Throughput benchmark: summary

Growable queue performs well regardless of initial size for this case.
For chunked and unbounded the chunk size has definite implications on throughput. Having said that throughput is very good even for relatively small chunks.
Note that the results for the same benchmark without a GC cycle between iterations were very noisy. The above result intentionally removes the variance GC induces by forcing GC and allowing a large heap. The GC impact of linked array queues when churning will likely be in increasing old generation pressure as the overflow chunks are likely to have been promoted before they get collected. This is assuming a load where overflow is not that frequent and other allocation is present.

Interlude

Go ahead, grab a beer, or a coffee, a mojito perhaps(Norman/Viktor, go on), or maybe order a large Pan Galactic Gargle Blaster, you've earned it. I never thought you'd read this far, it's a tad dry innit? Well, it's not fun writing it either, but we're getting there, just need to look at one more benchmark...

Burst "cost"/latency benchmark: background and method

The burst cost benchmark is a more stable workload than the throughput one. The producer sends a burst of messages to a consumer. The consumer signals completion when the last message in the burst has arrived. The measurement is from first message sent and arrival of last message observed from the producer thread. It's a 'latency' benchmark, or rather an estimate of average communication cost via the particular thread. It's got bells on. It's a friend, and it's a companion, it's the only product you will ever need, follow these easy assembly instructions it never needs ironing.

This is, I think, a better evaluation of queue characteristics than the throughput benchmark for most applications. Queue starts empty, is hit with a burst of traffic and the burst is drained. The cost measured is inclusive of return signal latency, but as scenarios go this is not too far fetched. Calling this queue latency is a damn sight better than PRETENDING THE BLOODY INVERSE OF THROUGHPUT IS LATENCY. <deep breath>

Same machine and JMH parameters used as above. All the measurements below are average time per operation in nanoseconds. The benchmark code can be found here.

Burst Cost benchmark: baseline

Testing first with SpscArrayQueue and SpscLinkedQueue to establish the expected baseline behaviour, BRST is the size of the burst:

SpscArrayQueue (128k capacity)

BRST Score Error Units

1 284.709 ± 8.813 ns/op

10 368.028 ± 6.949 ns/op

100 914.150 ± 11.424 ns/op

Right, sending one message has the overhead of cache coherency making data visible to another core. Sending 10/100 messages we can see the benefits of the SpscArrayQueue in allowing consumer and producer to minimize cache coherency overhead per element. We see a satisfying drop in cost per element as the burst size grows (the per element cost is the cost of the burst divided by the number of elements sent, so we see here: 1 -> 284, 10 -> 36, 100 -> 9), but this DOES NOT MEAN THE FRIGGIN' LATENCY IS BLOOMIN' DOWN TO 9ns WHEN WE SEND 100 MESSAGES.

SpscLinkedQueue

BRST Score Error Units

1 378.043 ± 7.536 ns/op

10 1675.589 ± 44.496 ns/op

100 17036.528 ± 492.875 ns/op

For the linked queue the per element overheads are larger, as well as the cost of scanning through a linked list rather than an array as we poll data out. The gap between the it and SpscArrayQueue widens as the burst size grows. The linked queue fails to make the most of the batching opportunity offered by slack in the queue in other words.

Burst Cost benchmark: growable

We expect the growable queue to grow to accommodate the size of the burst. The eventual buffer size will be a tighter fit around the burst size, which in theory might be a benefit as the array is more likely to fit in cache. Let's spin the wheel (CNK is the initial chunk size, the max size is 128K):

BRST CNK Score Error Units

1 16 327.703 ± 11.485 ns/op

1 128 292.382 ± 9.807 ns/op

1 1K 275.573 ± 6.230 ns/op

1 16K 286.354 ± 6.980 ns/op

10 16 599.540 ± 73.376 ns/op

10 128 386.828 ± 10.016 ns/op

10 1K 376.295 ± 8.009 ns/op

10 16K 358.096 ± 6.107 ns/op

100 16 1173.644 ± 28.669 ns/op

100 128 1152.241 ± 40.067 ns/op

100 1K 966.612 ± 9.504 ns/op

100 16K 951.495 ± 12.425 ns/op

We have to understand the implementation to understand the results here, in particular:

The growable queue buffer will grow to accommodate the burst in a power of 2 sized array. This in particular means that when the burst size is 100 the buffer for the initially smaller 16 chunk queue is also 128. The delta between the 2 configurations becomes marginal once that happens as we see in the 100 burst which forces the initially size 16 element buffer to grow to 128.
The queue tries to probe ahead within a buffer to avoid reading on each element.The read ahead step is a 25% of the buffer size. The smaller the buffer the more often we need to probe ahead (e.g. for a 16 element buffer we do this every 4 elements). This overhead is visible in the smaller buffers.
A burst which manages to fill more than 75% will fail to read ahead with the long probe described above and fall back to reading a single element ahead. This implies that buffers that fit too snugly to the burst size will have worse performance.
When the buffers are sufficiently large the costs closely match the costs observed for the SpscArrayQueue. Yay!

Burst Cost benchmark: chunked

For Chunked we see a slight increase in base cost and a bummer when the burst size exceeds the chunk size (CNK is the chunk size, the max size is 128K):

BRST CNK Score Error Units

1 16 311.743 ± 11.613 ns/op

1 128 295.987 ± 5.468 ns/op

1 1K 281.308 ± 8.381 ns/op

1 16K 281.962 ± 7.376 ns/op

10 16 478.687 ± 52.547 ns/op

10 128 390.041 ± 16.029 ns/op

10 1K 371.067 ± 7.789 ns/op

10 16K 386.683 ± 5.276 ns/op

100 16 2513.226 ± 38.285 ns/op

100 128 1117.990 ± 14.252 ns/op

100 1K 969.435 ± 10.072 ns/op

100 16K 939.010 ± 8.173 ns/op

Results are overall similar to the growable, what stands out is:

If the chunk is too little to accommodate the burst we see a large increase to cost. Still, comparing this to the SpscLinkedQueue shows a significant benefit. Comparing to the growable version we see the sense in perhaps letting the queue grow to a better size as a response to bursts.
If the chunk is large enough to accommodate the burst behaviour closely matches SpscGrowableArrayQueue. Yay!

Burst Cost benchmark: unbounded

Final one, just hang in there.

BRST CNK Score Error Units

1 16 303.030 ± 11.812 ns/op

1 128 308.158 ± 11.064 ns/op

1 1K 286.379 ± 6.027 ns/op

1 16K 282.574 ± 10.886 ns/op

10 16 554.285 ± 54.468 ns/op

10 128 407.350 ± 11.227 ns/op

10 1K 379.716 ± 9.357 ns/op

10 16K 370.885 ± 12.068 ns/op

100 16 2748.900 ± 64.321 ns/op

100 128 1150.393 ± 26.355 ns/op

100 1K 1005.036 ± 14.491 ns/op

100 16K 979.372 ± 13.369 ns/op

What stands out is:

If the chunk is too little to accommodate the burst we see a large increase to cost. Still, comparing this to the SpscLinkedQueue shows a significant benefit.
If the chunk is large enough to accommodate the burst and make the most of probing ahead the costs closely resemble the SpscArrayQueue for larger bursts. Yay!

Burst Cost benchmark: summary

We see a pretty much expected result for these queues, which is to say that on the fast path they are the same and therefore if the fast path dominates they show the same costs as a plain SpscArrayQueue, which is good news. When chunks are too small and we have to allocate new chunks we start to see overheads.

A more subtle observation here is that smaller buffers have some drawbacks as the slow path of the producer code is more likely to be executed. This reflects correctly the empty queue assumption that the JCTools queues rely on, but broken assumptions are... well... broken, so the cost goes up.

A further consideration here for smaller buffer is the hot/cold structure of the code. It is intended that the producer code inlines the "offer" hot path, but as the cold path is rarely run it will fail to inline it. This is an intentional inlining fail. Inlining the cold path will make the "offer" larger and allot more complex, making the compilers job harder and may result in worse resulting code. When we run with burst/buffer sizes which systematically violate the hot/cold assumption we can trigger a bad inlining decision. This can be worked around by marking the cold methods as "dontinline" using the CompileCommand option or the compiler oracle file.

Mmmm... this is boring :(

Yes... Nothing too surprising happened here, I did not emerge from the lab with my coat on fire, these things happen. One anecdote worth sharing here is that I originally run the benchmarks with only 2 threads allocated to the JVM, this resulted in noisier measurement as I effectively under provisioned the JVM with CPUs for compilation/GC or any OS scheduling contention/interrupts. When running on a 2 core laptop this is a reasonable compromise to fix the cross core topology of the benchmark, but on a server class machine it is easy enough to provision the same topology with more CPUs.

Next part will feature the astounding extension of these queues to the MPSC domain and will be far more interesting! I promise.

Psychosomatic, Lobotomy, Saw

Tuesday, 13 December 2016

Linked Array Queues, part 2: SPSC Benchmarks

Setup:

Throughput benchmark: background and method

Throughput benchmark: baseline(JMH params: -bm thrpt -tu us)

Throughput benchmark: growable

Throughput benchmark: chunked

Throughput benchmark: unbounded

Throughput benchmark: summary

Interlude

Burst "cost"/latency benchmark: background and method

Burst Cost benchmark: baseline

Burst Cost benchmark: growable

Burst Cost benchmark: chunked

Burst Cost benchmark: unbounded

Burst Cost benchmark: summary

Mmmm... this is boring :(

No comments:

Post a Comment