Tuesday 20 December 2016

What do Atomic*::lazySet/Atomic*FieldUpdater::lazySet/Unsafe::putOrdered* actually mean?

Paved with well intended definitions it is.
lazySet/putOrdered (or an ordered store) was added as a bit of a rushed/non-commital afterthought after the JMM was done, so it's description is subject to many debates on the mailing lists, stack overflow and watercoolers the world over. This post merely tries to provide a clear definition with references to relevant/reputable sources.

Definition:
An ordered store is a weakened volatile store[2][5]. It prevents preceding stores and loads from being reordered with the store[1][3][6], but does not prevent subsequent stores and loads from being reordered with it[2][4].
If there was a JMM cookbook entry for ordered store defining it with barriers in mind it would seem that the consensus is that ordered stores are preceded by a StoreStore AND a LoadStore barrier[4][6].

Ordered store is practically the same as a C++ memory_release_store[5][7].


References:
[1] Original bug:
"lazySet provides a preceding store-store barrier (which is either a no-op or very cheap on current platforms), but no store-load barrier"
See here: http://bugs.java.com/bugdatabase/view_bug.do?bug_id=6275329

[2] java.util.concurrent docs:
"lazySet has the memory effects of writing (assigning) a volatile variable except that it permits reorderings with subsequent (but not previous) memory actions that do not themselves impose reordering constraints with ordinary non-volatile writes."
See here: https://docs.oracle.com/javase/8/docs/api/?java/util/concurrent/package-summary.html

[3] JMM cookbook: Defining barriers meaning here: http://g.oswego.edu/dl/jmm/cookbook.html

[4] concurrency-interest Q&A with Doug Lea, October 2011:
"[Ruslan]:... If it happens (== we see spin-wait loop finished) -- does it mean,that all writes preceding lazySet are also done, committed, and visible to thread 2, which finished spin-wait loop?
[Doug]: Yes, although technically, you cannot show this by reference to the Synchronization Order in the current JLS.
...
lazySet basically has the properties of a TSO store"
See here: http://cs.oswego.edu/pipermail/concurrency-interest/2011-October/008296.html

The discussion is REALLY worth reading, involving Hans, Doug, Vitaly, Ruslan and other such respectable members of this excellent mailing list. Go on, I'll wait.

The discussion on that thread concludes the following is required:
...
LoadStore + StoreStore
st [Y],X // store X into memory address Y
...

Outcome: Stores before and after are now prevented from floating across the barrier. Loads before the barrier are also prevented from floating down. Later loads are free to float up. Note that st may in theory be delayed indefinitely, certainly other loads and stores are allowed to float up between it and the barrier.

[5] concurrency-interest Q&A with Aleksey Shipilev, May 2016:
"putOrdered is a release in disguise, most of the C++11 std::atomic(...,
mem_order_release) reasoning applies here."
"acquire/release are the relaxations from the usual volatile
rules -- while producing happens-before-s, they drop from total
synchronization order, thus breaking sequential consistency."

And adds some fine examples:
"Safe publication still works:

                       int x; volatile int y;
-----------------------------------------------------------------------
    put(x, 1);                   |  r1 = get{Acquire|Volatile}(y);
    put{Release|Volatile}(y, 2); |  r2 = get(x);

(r1, r2) = (2, 0) is forbidden.

But anything trickier that requires sequential consistency fails. IRIW
fails, because no consistent write order observed by all threads. Dekker
fails, because release stores followed by loads may or may not be
visible in program order:

                     volatile int x; volatile int y;
-----------------------------------------------------------------------
    putRelease(x, 1);            |    putRelease(y, 1);
    r1 = getAcquire(y);          |    r2 = getAcquire(x);

(r1, r2) = (0, 0) is allowed. Forbidden if all ops are volatile.


Safe construction still does not work (even for volatiles!):

                                A global;
-----------------------------------------------------------------------
    A a = <alloc>;                  |  A a = global;
    put{Release|Volatile}(a.x, 1);  |  r1 = get{Acquire|Volatile}(a.x);
    global = a;                     |

(r1) = (0) is allowed."
See here: http://cs.oswego.edu/pipermail/concurrency-interest/2016-May/015104.html

[6] concurrency-interest Q&A with Aleksey Shipilev, March 2016:
"> int a, b;
> 
> boolean tryLock() {
>     UNSAFE.putOrdered(a, 1); // Declare intent.
> 
>     // No StoreLoad here as store is not volatile.
> 
>     if (UNSAFE.getVolatile(b) == 1)) {
>         // Reset intent and return false;
>     }
> 
>     return true;
> }

Even in the naive barrier interpretation that usually gives stronger
answers, you have:

 [LoadStore|StoreStore]
 a = 1;

 r1 = b;
 [LoadLoad|LoadStore]"
See here: http://cs.oswego.edu/pipermail/concurrency-interest/2016-March/015037.html

[7] C++ memory_order_release definition:
"A store operation with this memory order performs the release operation: no reads or writes in the current thread can be reordered after this store. All writes in the current thread are visible in other threads that acquire the same atomic variable (see Release-Acquire ordering below) and writes that carry a dependency into the atomic variable become visible in other threads that consume the same atomic (see Release-Consume ordering below)."
See here: http://en.cppreference.com/w/cpp/atomic/memory_order

Many thanks to A. Shipilev, M. Thompson, D. Lawrie and C. Ruslan  for reviewing, any remaining errors are their own and they shall be most severely reprimanded for them.

Tuesday 13 December 2016

Linked Array Queues, part 2: SPSC Benchmarks

JCTools has a bunch of benchmarks we use to stress test the queues and evaluate optimizations.
These are of course not 'real' workloads, but serve to highlight imperfections and opportunities. While it is true that an optimization might work in a benchmark but not in the real world, a benchmark can work as a demonstration that there are at least circumstances in which it does work. All measurement is imperfect, but not as imperfect as claims made with no fucking evidence whatsoever, so here goes.
How do these linked-array queues fare in the benchmarks? what can we learn here?
The linked array queues are a hybrid of the array and linked queues. So it seems reasonable that we should compare them to both SpscArrayQueue and SpscLinkedQueue. We should also consider how the queues differ and see if we can flush out the differences via the benchmarks.
If you crack under the pressure of boring details, skip to the summary, do not stop at interlude, do not collect a cool drink or get praise, just be on yer fuckin' merry way.

Setup:

Benchmarks are run on a quiet server class machine:
  • Xeon processor(Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz): 2 CPUs x 12 cores x 2 threads (HT)
  • CentOS
  • Oracle JDK8u101
  • All benchmarks are run taskset to cores on the same numa node, but such that threads cannot share the same physical core.
  • Turbo boost is off, the scaling governor is userspace and the frequency is fixed.
  • The code is on github

Throughput benchmark: background and method

A throughput benchmark for queues is a tricky fucker. In particular the results change meaning depending on the balance between consumer and producer:
  • If the consumer is faster than the producer we are measuring empty queue contention (producer/consumer hitting the same cache line for elements in the queue, perhaps sampling each other index). Empty queues are the expected state for responsive applications.
  • If the producer is faster than the consumer we are measuring full queue contention, which may have similar issues. For some queues which optimize for the healthy assumption that queues are mostly empty this may be a particularly bad place to be.
  • If the producer and consumer are well balanced we are testing a streaming use case which offers the most opportunities for progress for both consumer and producer. This should yield the best performance, but for most applications may not be a realistic scenario at all.
The JCTools throughput benchmark does not resolve these issues. It does however report results which give us an idea of poll/offer failure rates which are in turn indicative of which state we find ourselves in.
A further challenge in managed runtime environments, which is unrelated to queues, is that garbage generating benchmarks will have GC state accumulate across measurement iterations. The implication is that each iteration is measuring from a different starting state. Naturally occurring GCs will leave the heap in varying states depending on the point at which they hit. We can choose to either embrace the noise in the measurement as an averaging of the cost/overhead of garbage or allocate a large enough heap to accommodate a single iteration worth of allocation and force a full GC per iteration, thus resetting the state per iteration. The benchmarks below were run with 8g heap and a GC cycle between iterations.
The benchmark I run here is the no backoff version of the throughput benchmark where failure to offer/poll makes no attempt at waiting/yielding/tapping of foot and just tries again straight away. This serves to maximize contention and is not a recipe for happiness in real applications.
JMH parameters common to all runs below:
  • -gc true -> GC cycle between iterations
  • -jvmArgs="-Xmx8g -Xms8g" -> 8g heap
  • -i 10  -r 1 -> 10 measurement iterations, 1 second each
  • -wi 20 -w 1 -> 20 warmup iterations, 1 second each
  • -f 5 -> five forks each to expose run to run variance

Throughput benchmark: baseline(JMH params: -bm thrpt -tu us)

Here's some baseline results, note the unit is ops/us equal to millions of ops per second:
SpscArrayQueue (128k capacity)
offersFailed   0.005 ±  0.008  ops/us
offersMade   252.201 ±  1.649  ops/us
pollsFailed    0.009 ±  0.008  ops/us
pollsMade    252.129 ±  1.646  ops/us

So the SpscArrayQueue is offering great throughput, and seems pretty well balanced with failed offers/polls sort of cancelling out and low compared to the overall throughput.

SpscLinkedQueue
offersFailed     ≈ 0           ops/us
offersMade    14.711 ±  5.897  ops/us
pollsFailed   12.624 ±  8.281  ops/us
pollsMade     14.710 ±  5.896  ops/us

For the SpscLinkedQueue we have no failed offers, since it's an unbounded queue. We do see a fair amount of failed polls. We expect the polls to be faster than the offers as offering pays for allocation of nodes on each element (24b overhead per element), while the poll simply leaves it to the GC to toss it all away.
With this baseline we would expect linked arrays queues performance to be somewhere between the 2 data points above. Unlikely to hit the highs of the preallocated array queue, but hopefully much better than a linked queue.

Throughput benchmark: growable

So assuming we let it grow to 128k, how does the SpscGrowableArrayQueue perform in this benchmark and how much does the initial size impact the performance? CNK here is the initial buffer size. The buffer will double in size when offer fills up a buffer until we hit the max size buffer.
 CNK                 Score    Error   Units
  16 offersFailed    0.006 ±  0.006  ops/us
  16 offersMade    183.720 ±  0.450  ops/us
  16 pollsFailed     0.003 ±  0.001  ops/us
  16 pollsMade     183.592 ±  0.450  ops/us
 128 offersFailed    0.003 ±  0.006  ops/us
 128 offersMade    184.236 ±  0.336  ops/us
 128 pollsFailed     0.003 ±  0.001  ops/us
 128 pollsMade     184.107 ±  0.336  ops/us
  1K offersFailed    0.001 ±  0.003  ops/us
  1K offersMade    183.113 ±  1.385  ops/us
  1K pollsFailed     0.003 ±  0.001  ops/us
  1K pollsMade     182.985 ±  1.385  ops/us
 16K offersFailed    0.007 ±  0.006  ops/us
 16K offersMade    181.388 ±  5.380  ops/us
 16K pollsFailed     0.004 ±  0.001  ops/us
 16K pollsMade     181.259 ±  5.380  ops/us

  • Under constant streaming pressure the Growable queue will keep growing until either full sized buffer is allocated (very likely) or a smaller buffer in which the throughput is sustainable is found (unlikely for this benchmark as all it takes is a single spike). If that was the case we would have no failing offers. Either way we expect transition to the last buffer to be a short phase after which the algorithm is very similar to SpscArrayQueue and no further allocations happen. The number of resizing events is small, as the buffer doubles each time (so log2(capacity/initial size), e.g. for initial capacity 16k: 16k -> 32k -> 64k -> 128k).
  • You may consider the slow down from SpscArrayQueue large at roughly 25%, but I don't think it too bad considering that with the throughputs in question we are looking at costs in the single digit nanoseconds where every extra instruction is going to show up (back of envelope: 250 ops/us -> ~4ns per offer/poll vs 180 ops/us -> ~5ns. 1ns = ~3 cycle ~= 12 instructions or 1 L1 load).

Throughput benchmark: chunked

For Chunked we see the expected increase in throughput as we increase the chunk size (CNK is the fixed chunk size, the max size is 128K):
 CNK                 Score    Error   Units
  16 offersFailed      ≈ 0           ops/us
  16 offersMade     43.665 ±  0.892  ops/us
  16 pollsFailed     9.160 ±  0.519  ops/us
  16 pollsMade      43.665 ±  0.892  ops/us
 128 offersFailed   ≈ 10⁻⁴           ops/us
 128 offersMade    151.473 ± 18.786  ops/us
 128 pollsFailed     0.380 ±  0.331  ops/us
 128 pollsMade     151.443 ± 18.778  ops/us
  1K offersFailed    0.309 ±  0.375  ops/us
  1K offersMade    149.351 ± 14.102  ops/us
  1K pollsFailed     0.112 ±  0.125  ops/us
  1K pollsMade     149.314 ± 14.120  ops/us
 16K offersFailed   ≈ 10⁻⁸           ops/us
 16K offersMade    175.408 ±  1.563  ops/us
 16K pollsFailed     0.038 ±  0.031  ops/us
 16K pollsMade     175.394 ±  1.563  ops/us

  • Note the decline in throughput for smaller chunks is matched with an increase in poll failures indicating that the consumer is becoming faster than the producer as the chunk grows smaller requiring more frequent allocations by the produce.
  • Note also that even with 16 slot chunks this option is ~3 times faster than the linked alternative.
  • Under constant streaming pressure the Chunked queue will be pushed to it's maximum size, which means the producer will be constantly allocating buffers. The producer resize conditions are also slightly trickier and require sampling of the consumer index. The consumer will be slowed down by this sampling, and also slowed down by jumping to new buffers. This problem will be worse as more resizing happens, which is a factor of chunk size.
  • The benefit of larger chunks will cap out at some point, you could explore this parameter to find the optimum.
  • An exercise to readers: run the benchmark with the JMH GC profiler and compare the queues. Use it to verify the assumption that Growable produces a bounded amount of garbage, while Chunked continues to churn.
  • Max throughput is slightly behind Growable.
The main take aways for sizing here seem to me that tiny chunks are bad, but even with small/medium chunks you can have pretty decent throughput. The right size for your chunk should therefore depend on your expectations of average traffic on the one hand and desirable size when empty.

Throughput benchmark: unbounded

For unbounded we see the expected increase in throughput as we increase the chunk size  (CNK is the chunk size, the max size is infinity and beyond):
 CNK                 Score    Error   Units
  16 offersFailed      ≈ 0           ops/us
  16 offersMade     56.315 ±  7.563  ops/us
  16 pollsFailed    10.823 ±  1.611  ops/us
  16 pollsMade      56.315 ±  7.563  ops/us
 128 offersFailed      ≈ 0           ops/us
 128 offersMade    135.119 ± 23.306  ops/us
 128 pollsFailed     1.236 ±  0.851  ops/us
 128 pollsMade     131.770 ± 21.535  ops/us
  1K offersFailed      ≈ 0           ops/us
  1K offersMade    182.922 ±  3.397  ops/us
  1K pollsFailed     0.005 ±  0.003  ops/us
  1K pollsMade     176.208 ±  3.221  ops/us
 16K offersFailed      ≈ 0           ops/us
 16K offersMade    177.586 ±  2.929  ops/us
 16K pollsFailed     0.031 ±  0.038  ops/us
 16K pollsMade     176.884 ±  2.255  ops/us

  • The 16 chunk size is ~4 times faster than the linked list option, as chunk size increases it gets more efficient.
  • Max throughput is slightly behind growable.
  • Why is Chunked faster than Unbounded on 128 chunks, but slower on 1K? I've not looked into it, it's taken long enough to write this bloody post as it is. How about you check it out and let me know?

Throughput benchmark: summary

  • Growable queue performs well regardless of initial size for this case.
  • For chunked and unbounded the chunk size has definite implications on throughput. Having said that throughput is very good even for relatively small chunks. 
  • Note that the results for the same benchmark without a GC cycle between iterations were very noisy. The above result intentionally removes the variance GC induces by forcing GC and allowing a large heap. The GC impact of linked array queues when churning will likely be in increasing old generation pressure as the overflow chunks are likely to have been promoted before they get collected. This is assuming a load where overflow is not that frequent and other allocation is present.

Interlude

Go ahead, grab a beer, or a coffee, a mojito perhaps(Norman/Viktor, go on), or maybe order a large Pan Galactic Gargle Blaster, you've earned it. I never thought you'd read this far, it's a tad dry innit? Well, it's not fun writing it either, but we're getting there, just need to look at one more benchmark...

Burst "cost"/latency benchmark: background and method

The burst cost benchmark is a more stable workload than the throughput one. The producer sends a burst of messages to a consumer. The consumer signals completion when the last message in the burst has arrived. The measurement is from first message sent and arrival of last message observed from the producer thread. It's a 'latency' benchmark, or rather an estimate of average communication cost via the particular thread. It's got bells on. It's a friend, and it's a companion, it's the only product you will ever need, follow these easy assembly instructions it never needs ironing.
This is, I think, a better evaluation of queue characteristics than the throughput benchmark for most applications. Queue starts empty, is hit with a burst of traffic and the burst is drained. The cost measured is inclusive of return signal latency, but as scenarios go this is not too far fetched. Calling this queue latency is a damn sight better than PRETENDING THE BLOODY INVERSE OF THROUGHPUT IS LATENCY. <deep breath>
Same machine and JMH parameters used as above. All the measurements below are average time per operation in nanoseconds. The benchmark code can be found here.

Burst Cost benchmark: baseline

Testing first with SpscArrayQueue and SpscLinkedQueue to establish the expected baseline behaviour, BRST is the size of the burst:
SpscArrayQueue (128k capacity)
BRST      Score     Error  Units
  1     284.709 ±   8.813  ns/op
 10     368.028 ±   6.949  ns/op
100     914.150 ±  11.424  ns/op

Right, sending one message has the overhead of cache coherency making data visible to another core. Sending 10/100 messages we can see the benefits of the SpscArrayQueue in allowing consumer and producer to minimize cache coherency overhead per element. We see a satisfying drop in cost per element as the burst size grows (the per element cost is the cost of the burst divided by the number of elements sent, so we see here: 1 -> 284, 10 -> 36, 100 -> 9), but this DOES NOT MEAN THE FRIGGIN' LATENCY IS BLOOMIN' DOWN TO 9ns WHEN WE SEND 100 MESSAGES.

SpscLinkedQueue
BRST      Score     Error  Units
  1     378.043 ±   7.536  ns/op
 10    1675.589 ±  44.496  ns/op
100   17036.528 ± 492.875  ns/op

For the linked queue the per element overheads are larger, as well as the cost of scanning through a linked list rather than an array as we poll data out. The gap between the it and SpscArrayQueue widens as the burst size grows. The linked queue fails to make the most of the batching opportunity offered by slack in the queue in other words.

Burst Cost benchmark: growable

We expect the growable queue to grow to accommodate the size of the burst. The eventual buffer size will be a tighter fit around the burst size, which in theory might be a benefit as the array is more likely to fit in cache. Let's spin the wheel (CNK is the initial chunk size, the max size is 128K):
BRST  CNK    Score    Error  Units
  1    16  327.703 ± 11.485  ns/op
  1   128  292.382 ±  9.807  ns/op
  1    1K  275.573 ±  6.230  ns/op
  1   16K  286.354 ±  6.980  ns/op
 10    16  599.540 ± 73.376  ns/op
 10   128  386.828 ± 10.016  ns/op
 10    1K  376.295 ±  8.009  ns/op
 10   16K  358.096 ±  6.107  ns/op
100    16 1173.644 ± 28.669  ns/op
100   128 1152.241 ± 40.067  ns/op
100    1K  966.612 ±  9.504  ns/op
100   16K  951.495 ± 12.425  ns/op

We have to understand the implementation to understand the results here, in particular:
  • The growable queue buffer will grow to accommodate the burst in a power of 2 sized array. This in particular means that when the burst size is 100 the buffer for the initially smaller 16 chunk queue is also 128. The delta between the 2 configurations becomes marginal once that happens as we see in the 100 burst which forces the initially size 16 element buffer to grow to 128.
  • The queue tries to probe ahead within a buffer to avoid reading on each element.The read ahead step is a 25% of the buffer size. The smaller the buffer the more often we need to probe ahead (e.g. for a 16 element buffer we do this every 4 elements). This overhead is visible in the smaller buffers.
  • A burst which manages to fill more than 75% will fail to read ahead with the long probe described above and fall back to reading a single element ahead. This implies that buffers that fit too snugly to the burst size will have worse performance.
  • When the buffers are sufficiently large the costs closely match the costs observed for the SpscArrayQueue. Yay!

Burst Cost benchmark: chunked

For Chunked we see a slight increase in base cost and a bummer when the burst size exceeds the chunk size (CNK is the chunk size, the max size is 128K):
BRST  CNK    Score    Error  Units
  1    16  311.743 ± 11.613  ns/op
  1   128  295.987 ±  5.468  ns/op
  1    1K  281.308 ±  8.381  ns/op
  1   16K  281.962 ±  7.376  ns/op
 10    16  478.687 ± 52.547  ns/op
 10   128  390.041 ± 16.029  ns/op
 10    1K  371.067 ±  7.789  ns/op
 10   16K  386.683 ±  5.276  ns/op
100    16 2513.226 ± 38.285  ns/op
100   128 1117.990 ± 14.252  ns/op
100    1K  969.435 ± 10.072  ns/op
100   16K  939.010 ±  8.173  ns/op

Results are overall similar to the growable, what stands out is:
  • If the chunk is too little to accommodate the burst we see a large increase to cost. Still, comparing this to the SpscLinkedQueue shows a significant benefit. Comparing to the growable version we see the sense in perhaps letting the queue grow to a better size as a response to bursts.
  • If the chunk is large enough to accommodate the burst behaviour closely matches SpscGrowableArrayQueue. Yay!

Burst Cost benchmark: unbounded

Final one, just hang in there. 
BRST  CNK    Score    Error  Units
  1    16  303.030 ± 11.812  ns/op
  1   128  308.158 ± 11.064  ns/op
  1    1K  286.379 ±  6.027  ns/op
  1   16K  282.574 ± 10.886  ns/op
 10    16  554.285 ± 54.468  ns/op
 10   128  407.350 ± 11.227  ns/op
 10    1K  379.716 ±  9.357  ns/op
 10   16K  370.885 ± 12.068  ns/op
100    16 2748.900 ± 64.321  ns/op
100   128 1150.393 ± 26.355  ns/op
100    1K 1005.036 ± 14.491  ns/op
100   16K  979.372 ± 13.369  ns/op

What stands out is:
  • If the chunk is too little to accommodate the burst we see a large increase to cost. Still, comparing this to the SpscLinkedQueue shows a significant benefit.
  • If the chunk is large enough to accommodate the burst and make the most of probing ahead the costs closely resemble the SpscArrayQueue for larger bursts. Yay!

Burst Cost benchmark: summary

We see a pretty much expected result for these queues, which is to say that on the fast path they are the same and therefore if the fast path dominates they show the same costs as a plain SpscArrayQueue, which is good news. When chunks are too small and we have to allocate new chunks we start to see overheads.
A more subtle observation here is that smaller buffers have some drawbacks as the slow path of the producer code is more likely to be executed. This reflects correctly the empty queue assumption that the JCTools queues rely on, but broken assumptions are... well... broken, so the cost goes up.
A further consideration here for smaller buffer is the hot/cold structure of the code. It is intended that the producer code inlines the "offer" hot path, but as the cold path is rarely run it will fail to inline it. This is an intentional inlining fail. Inlining the cold path will make the "offer" larger and allot more complex, making the compilers job harder and may result in worse resulting code. When we run with burst/buffer sizes which systematically violate the hot/cold assumption we can trigger a bad inlining decision. This can be worked around by marking the cold methods as "dontinline" using the CompileCommand option or the compiler oracle file.

Mmmm... this is boring :(

Yes... Nothing too surprising happened here, I did not emerge from the lab with my coat on fire, these things happen. One anecdote worth sharing here is that I originally run the benchmarks with only 2 threads allocated to the JVM, this resulted in noisier measurement as I effectively under provisioned the JVM with CPUs for compilation/GC or any OS scheduling contention/interrupts. When running on a 2 core laptop this is a reasonable compromise to fix the cross core topology of the benchmark, but on a server class machine it is easy enough to provision the same topology with more CPUs.
Next part will feature the astounding extension of these queues to the MPSC domain and will be far more interesting! I promise.

    Monday 24 October 2016

    Linked Array Queues, part 1: SPSC

    When considering concurrent queues people often go for either:
    1. An array backed queue (circular array/ring buffer etc.)
    2. A linked list queue
    The trade off in the Java world seems to be that array backed queues offer better throughput, but are always bounded and allocated upfront, and linked queues offer smaller footprint when empty, but worse throughput and higher overhead per element when full. Linked queues are also often unbounded.
    In JCTools we have the above duality, but in later versions introduced a hybrid queue which attempts to offer a middle ground between the 2 extremes above. This hybrid is the linked array queue:
    1. Queue is backed by one or more arrays of references.
    2. As long as the queue can fit in a single array no further arrays are allocated.
    3. When empty the queue can naturally shrink to a single backing array.
    For the SPSC case this has already been done in C++ Fast Flow with their uSPSC queues. In Java there are no other implementations that I know of (please let me know if I missed anything).
    In implementing these queues I have relied heavily on the feedback and support of @akarnokd@pcholakov and others who contributed fixes, filed bugs, and so forth. Thanks guys!
    3 variations on linked array queues have been explored in JCTools:
    1. Chunked: Each backing array is a fixed chunk size. The queue is bounded.
    2. Unbounded: Each backing array is a fixed chunk size. The queue is unbounded.
    3. Growable: Chunk size doubles every time until the full blown backing array is used. The queue is bounded.

    Hot Path offer/poll

    The queues all share the same polling logic and on the fast path share the same offer logic as well:
    If you've not read JCTools code before, or maybe you've forgotten, here's the convention:
    • sv/lvXXX: Store/Load Volatile, same as a volatile field write/read
    • sp/lpXXX: Store/Load Plain, same as a normal field write/read
    • soXXX: Store Ordered, same as an AtomicXXX.lazySet
    This code will need reconsidering in a post-JDK9 world, some other time perhaps.
    So what's the deal:
    • As long as we are not passed the producerLimit, keep writing.
      • If we have passed it go to slow path (where the money be)
    • As long as there's stuff in the queue, read it.
      • Unless it's JUMP, in which case read through to next array.
    The general idea here being that the common case is small and simple. This should have the following effects:
    1. offer/poll code is small enough to inline when hot.
    2. offerColdPath/newBufferPoll are set up to either not inline or, when inlined be out of band code blocks. This should keep size on the hot path small and help the compiler focus on more important things.
    3. offer/poll should perform similar to the SpscArrayQueue in the common case.
    NOTE: The producer publishes element then index, but the consumer does the reverse. This ensures a consistent view on the consumer side where the following assertion must hold:
      !queue.isEmpty() => queue.poll() != null

    NOTE: In some early versions the new array was entered instead of the JUMP constant marker. This required an instanceof check for each element loaded and a compromise to either not allow Object[] to be an element of the queue or introduce some wrapper class. Comparing to a constant turned out to be much simpler and faster.

    Cold Path poll

    The consumer has hit a JUMP, which indicates the producer has linked a new array to this one. The new array is the last element of the current array. We go to newBufferPoll:

    The consumer therefore has to:
    1. Load new array from the last element of old array.
    2. Null out reference from old to new (prevent nepotism).
    3. Adjust consumer view on buffer and mask.
    4. Poll (remove element and progress counter) from new buffer.
    Note that peek similarly refreshes view of buffer on hitting the JUMP marker. This goes against the standard spirit on peek which is a view only operation.

    Cold Path offer: Unbounded

    This method is implemented differently for each of the use cases, unbounded is the simplest:
    In the unbounded case we care only about our ability to make progress inside the current buffer:

    1. Probe inside buffer to see if we have 'many' elements to go before full. If buffer is mostly empty (this is the case for most applications most of the time), than we have successfully saved ourselves loading elements from the queue before writing in. A successful probe will update the producer limit and write to the queue.
    2. Probe failed, we check if we can write a single element. Remember we always need one free slot to JUMP with, so we look at the slot following the one we are on. If the next slot is empty we write to the current one, but we don't update the limit.
    3. A single slot remains in the buffer. We allocate a new buffer and link to it.
    This is not a large departure from the SpscArrayQueue, I leave the comparison as an exercise to the delightful reader.

    Cold Path offer: Chunked

    With chunked linked array queues we have a fixed chunk size, but an overall bound on the size. This complicates matters because on top of the buffer level limits put on the producer we must also consider the queue level limitation. In particular there might be space available in the current producer buffer, while the queue is in fact full. Here's the implementation:
    Similar to the above but the difference lies in negotiating the buffer vs. queue limit.

    Cold Path offer: Growable

    The growable queue is similar in spirit to an ArrayList as it doubles it's backing array capacity when a buffer is full. This adds an interesting bit of information to the game, since:
    1. If we are not on the last buffer, we've not hit the queue limit,
    2. If we're on the last buffer, and so is the consumer, we can revert to just checking for null in the buffer.
    3. If we're on the last buffer, and the consumer isn't, we need to hang tight and let it pass. It's a temporary state.
    The code for handling this is rather ugly and long, but since you've put up with the rest:
    The lookAheadStep is dynamically adjusted as the buffer grows, and also acts as an indicator for the transition period which the producer is on the last buffer and the consumer is trailing. It's a mess to look at, but sadly performs better than a neater alternative which builds on the Chunked variant... General idea:

    1. lookAheadStep is positive => we are either not on last buffer, or on  it for both consumer and producer => it is enough to consider the elements in the producer buffer to determine if the queue is full. In particular if the buffer is full then we must resize unless we are on the last buffer in which case we are full. Note that we allow using the last buffer to the last element, since we don't need a JUMP anymore.
    2. lookAheadStep is negative => we are waiting for consumer to get to the last buffer. We use the lookAheadStep to indicate how far behind the consumer is.
    It's not complex, just messy, and if you got an elegant representation please ping me with your suggestions.

    Performance?

    GODDAMN it! this is the time consuming part! I've benchmarked on a few setups, but not kept track or clear records. I'll need to do it all again, might as well be another post since nobody reads this far into these things anyhow. La la la la la, performance is great, la la la la la, if I look now I might be disappointed, la la la la la, this is better than benchmarking, la la la la la.

    TO BE CONTINUED...

    Friday 21 October 2016

    4 Years Blog Anniversary


    This has been a very slow year blogging wise... too much work, travel, and family. 4 years ago I aimed for 2 posts a month. I nearly hit it in 2013, failed by a little in 2014, and went on to not meeting at all the last 2 years...
    Coincidently I also have a delightfully busy 2.5 year old running around, I suspect she has something to do with how little time I have. My eldest has developed a terrible addiction to reading so I hardly see her these days, can't be blame for this one.
    Since I started the blog I've also embarked on developing and maintaining JCTools, speaking at conferences, and starting a local JUG in Cape Town. Sharing of one kind breeds another, but you don't seem to gain bonus hours in the day. Hmmm.
    If I sound like I'm making excuses it's because I'm still feeling bad about missing that 2 posts a month goal.
    Despite my decreased output, blog traffic has steadily increased teaching me a thing or 2 about search and web economics. I hope to write more in the next year as I plan to give conferences a rest for the next 6-8 months, will see how it goes.
    In any case, thanks for reading :-)

    Monday 18 July 2016

    Fixing Coordinated Omission in Cassandra Stress

    Copyright © 2016 Apache Software Foundation
    I have it from reliable sources that incorrectly measuring latency can lead to losing ones job, loved ones, will to live and control of bowel movements. Out of my great love for fellow humans I have fixed the YCSB load generator to avoid the grave danger that is Coordinated Omission, this was met with great joy and I now gone on to fixup Cassandra Stress in similar vein. The fix is now merged into trunk, so expect to have it with the next Cassandra release (or just build it from source NOW).
    Before we start on the how and what, let us take a step back and consider the why:
    • Coordinated Omission is a term coined by my esteemed colleague Gil Tene to describe a common measurement anti pattern where the measuring system inadvertently coordinates with the system under measurement (watch Gil's How NOT to Measure Latency talk). This results in greatly skewed measurements in commonly used load generators, and has led to long discussions on Mechanical Sympathy mailing list etc. I've given my own explanation in a previous post, so go and have a read if you need a refresher on the subject.
    • Cassandra Stress (CS) is a tool which comes bundled with Cassandra to enable load testing of a given cluster. It allows the user to specify their own queries and schemas and is the prefered tool for Cassandra benchmarking as it gives better access to Cassandra features etc. CS allows 2 testing modes: throughput(default) or 'limit' where a given number of operations per second are thrown at the cluster (the fix discussed here does away with limit and replaces it with throttle/fixed, read on).
    So coordinated omission is bad, and it is now fixed in Cassandra Stress. This post talks a bit on motivation (part I), a bit on implementation (part II) and a bit on what you can do with the new features (part III). Feel free to skip any and all parts, god knows those selfies don't take themselves.

    PART I: Demonstrating The Issue

    CS default mode is to hit the system under test as hard as it can. This is a common strategy in load generators and can give system writers an interesting edge case to work with. I run the benchmark on my laptop (no attempt at finding out real performance numbers here, I just care about measurement issue) with the example provided workload I can saturate my Cassandra server (I only gave it a single core to run on) pretty easily. CS tells me the following about my miserable little server:
    INFO  15:02:46 Results:
    INFO  15:02:46 Op rate                   :    5,055 op/s  [insert: 500 op/s, simple1: 4,556 op/s]
    INFO  15:02:46 Partition rate            :   17,294 pk/s  [insert: 12,739 pk/s, simple1: 4,556 pk/s]
    INFO  15:02:46 Row rate                  :  150,266 row/s [insert: 12,739 row/s, simple1: 137,527 row/s]
    INFO  15:02:46 Latency mean              :    4.5 ms [insert: 7.5 ms, simple1: 4.2 ms]
    INFO  15:02:46 Latency median            :    3.0 ms [insert: 5.4 ms, simple1: 2.8 ms]
    INFO  15:02:46 Latency 95th percentile   :   13.1 ms [insert: 20.1 ms, simple1: 11.9 ms]
    INFO  15:02:46 Latency 99th percentile   :   23.8 ms [insert: 33.7 ms, simple1: 21.8 ms]
    INFO  15:02:46 Latency 99.9th percentile :   49.8 ms [insert: 55.0 ms, simple1: 49.0 ms]
    INFO  15:02:46 Latency max               :  603.5 ms [insert: 598.7 ms, simple1: 603.5 ms]
    INFO  15:02:46 Total partitions          :  1,000,000 [insert: 736,585, simple1: 263,415]
    INFO  15:02:46 Total errors              :          0 [insert: 0, simple1: 0]
    INFO  15:02:46 Total GC count            : 112
    INFO  15:02:46 Total GC memory           : 34.850 GiB
    INFO  15:02:46 Total GC time             :    1.0 seconds
    INFO  15:02:46 Avg GC time               :    9.4 ms
    INFO  15:02:46 StdDev GC time            :   16.6 ms
    INFO  15:02:46 Total operation time      : 00:00:57
    With the other 3 cores on my laptop hitting it as hard as they could the median latency on this maxed out server was 3ms. That is pretty awesome. But also, it makes no sense.
    How can a maxed out server have a typical response time of 3ms? In reality when servers are maxed out they are unresponsive, the typical response time becomes worse as the load increases. What CS is reporting however is not response time. It is 'latency'. Latency is one of those terms people use to mean many things and in this case in particular it does not mean "response time" but rather "service time". Here's a definition of more specific terms to describe system responsiveness(see wiki on response time):
    • Response time: The time between the submission of a request and the completion of the response.
    • Service time: The time between the initiation and completion of the response to a request.
    • Wait time: The time between the submission of the request and initiation of the response.
    In an all out test one could argue we want all the results as soon as possible, and given a magic load generator they would all launch instantaneously at the benchmark start. This will mean submission time for all requests is at the beginning of the run. Naturally the server will not be able to respond instantly to all requests and can only allow so many requests to be handled in parallel. If the max throughput is 5000 ops/sec, and we are measuring for 100,000 ops it would mean 95K ops have waited a second, so their response time would be at least 1 second (response time > wait time). By the end of the run we would have 5K ops which have patiently waited at least 19 seconds (so 99%ile should be at least 19 seconds).
    It follows that in an all out throughput benchmark response time is terrible by definition, and completely uninformative. It also follows that we should not expect the 'latency' above to be at all indicative of the sort of response time we would get from this server.
    The alternative to an all out throughput benchmark is a Responsiveness Under Load (RUL) benchmark. Using Cassandra Stress one can (or rather they could before this fix went in) use the '-rate limit=17000/s' option to benchmark under a load of 17k pks/sec.(pks = partition keys, each operation costs X keys, throughput limit is specified in pks not ops) Running this gives me a warm fuzzy feeling, now for sure I shall get a glimpse of the response time at max throughput:
    INFO  08:03:54 Results:
    INFO  08:03:54 Op rate                   :    3,712 op/s  [insert: 369 op/s, simple1: 3,343 op/s]
    INFO  08:03:54 Partition rate            :   12,795 pk/s  [insert: 9,452 pk/s, simple1: 3,343 pk/s]
    INFO  08:03:54 Row rate                  :  110,365 row/s [insert: 9,452 row/s, simple1: 100,913 row/s]
    INFO  08:03:54 Latency mean              :    1.0 ms [insert: 1.6 ms, simple1: 0.9 ms]
    INFO  08:03:54 Latency median            :    0.7 ms [insert: 1.3 ms, simple1: 0.7 ms]
    INFO  08:03:54 Latency 95th percentile   :    2.2 ms [insert: 3.4 ms, simple1: 2.0 ms]
    INFO  08:03:54 Latency 99th percentile   :    4.6 ms [insert: 7.4 ms, simple1: 4.1 ms]
    INFO  08:03:54 Latency 99.9th percentile :   13.4 ms [insert: 23.8 ms, simple1: 12.1 ms]
    INFO  08:03:54 Latency max               :   63.9 ms [insert: 59.9 ms, simple1: 63.9 ms]
    INFO  08:03:54 Total partitions          :    300,000 [insert: 221,621, simple1: 78,379]
    INFO  08:03:54 Total errors              :          0 [insert: 0, simple1: 0]
    INFO  08:03:54 Total GC count            : 33
    INFO  08:03:54 Total GC memory           : 10.270 GiB
    INFO  08:03:54 Total GC time             :    0.2 seconds
    INFO  08:03:54 Avg GC time               :    7.5 ms
    INFO  08:03:54 StdDev GC time            :    2.5 ms
    INFO  08:03:54 Total operation time      : 00:00:23
    This seems nice, and if I were not a suspicious man I might accept it. The thing is, I asked for 17k pks per second, but I only got  12,795 pk/s so obviously the server could not meet the implied schedule. If it could not meet the schedule, response time should be terrible. But it's not, it's very similar to the result we got above. Because? Because again, latency here means service time and not response time. While response time is not informative in an all out test, in an RUL benchmark it is the whole purpose of the benchmark. I have a schedule in mind, requests come at a particular rate, which implies they have a known start time (request n will start at: T0 + n/rate, T0 being the start of the test). This is coordinated omission, lets fix it.

    Part II(a): It Puts The Measurements In The HdrHistogram or It Gets The Hose Again!

    First off, I like to have HdrHistogram files to work with when looking at latency data (made a whole post about it too). They are usually tons better than whatever it is people hand roll, and they come with a bunch of tooling that makes data post processing easy. Importantly HdrHistograms are loss-less, configurable compact and support loss-less and compressed logging. Combine that with the high resolution of data and you have a great basis for post run analysis.
    Cassandra Stress had it own sampling latency collection mechanism which would randomly drop some samples as a means to improve memory footprint, so the replacement improved reporting accuracy and reduced the amount of code. A side effect of this change is that Cassandra perf regression testing stability has improved rather dramatically. As indicated by this graph:

    Because of the random sampling 99.9%ile reporting was unstable before the patch went in (May 28th), but smoooooth ever since. Ain't that nice?
    Here's the commit with the main changes on the branch that became the PR, there were further changes introduced to enable histogram logging to files. The change is mostly tedious, but importantly it introduces the notion of split response/service/wait histograms. The data is recorded as follows:
    The question is, what is the intended start time?

    Part II(b): Where Do I Start?

    Load generators, when not going all out, generally have some notion of schedule. It is more often than not quite simply a fixed rate of requests, though the notion of schedule holds regardless of how you come up with it. A schedule in this context means that for any event X there is a time when it should happen: st(X). That time is easily computed in a fixed rate schedule as: "st(n) = T0 + n/rate". Cassandra Stress however was using google's RateLimiter to provide it with the scheduling, and while battle tested and handy it does not have a notion of schedule. The replacement took place in 2 steps.
    First I refactored the existing code into hiding the details of how operations are scheduled and where they come from behind a blocking queue like interface. The next step was to support a fixed rate stream of operations where the intended schedule is available so I can use it. This is what I ended up with (further tweaked to only start the clock when all the consumer initialization has happened):

    Now we're all set to benchmark with no Coordinated Omission!

    Part II(c): Who Reports What?

    The measurement collection now all set up we face the question of what should different load generating modes report. Since it turned out that 'limit' was a confusing name (hard limit? upper limit?) it was decided to replace it with 2 distinct modes, adding to a total of 3 modes:
    • Throughput: latency == service time, response/wait time not recorded. Maximum throughput is an important test mode for flushing out CPU bottlenecks and contention. It may help in exposing IO configuration issues as well. It is not however a good predictor of response time distribution as no attempt is made to estimate independent request timing (i.e uncoordinated measurement). The maximum throughput number is valuable to batch processing etc, but I'd caution against using it for the sizing of a responsive system. If a system has responsiveness SLAs it is better to use the 'fixed' test mode and run for significant periods to determine the load under which response times SLA can be met.
    •  Throttle ("-rate throttle=1000/s"): latency == service time, response/wait time recorded like in fixed. This mode is a compromise for people who liked the old 'limit' measurement and want to measure "service time under attempted load". The partitions/second parameter is attempted for, but the summary does not reflect the response time. If you like, this is a way to sneakily record the response time while keeping the summary as it is so people are not so startled by the difference. I don't see myself using it, but have a blast.
    • Fixed  ("-rate fixed=1000/s"): latency == response time, service/response/wait time recorded. This mode is for benchmarking response time under load. The partitions/second parameter is used to determine the fixed rate scheduled of the requests. The hdr log can be used to visualize latencies over time or aggregate latencies for any segment down to the logging frequency. The logs contain response/wait/ service components that can be extracted and handled separately.
    In all of the above you can choose to record the HDR histograms to a file on a fixed interval. The interval is the one used for summary reporting. To enable histogram logging use: " -log interval=10s hdrfile=cs.hdr"
    Note that the interval should be quite long, and the default of 1s is not actually achievable by Cassandra Stress at the moment. This is down to the terrible way the reporting thread synchronizes with the load generation threads, and while it is one my wish list ("I wish I had time to fix it...") it was outside the scope of fixing CO, so lived to taunt us another day. I've settled on 10 second intervals for now, you can have even longer ones, all depends on the sort of granularity you want in your reports.

    Part III: What Is It Good For?

    So we got 3 load generating modes, and we've broken up latency into 3 components, the numbers are in histograms, the histograms are in the logs... let's have a little worked example.
    I run a Cassandra 2.1 node on my laptop (old software, bad hardware... I don't care. This is not about absolute numbers, it's about the measurement features). To run this please:
    1. Clone the Cassandra trunk: git clone https://github.com/apache/cassandra.git
    2. Build Cassandra and Cassandra Stress: ant clean build stress-build jar

    I use the example workload provided by the good Cassandra folks, and start with an all out stress test from cassandra/tools/bin:
    $ ./cassandra-stress user profile=../cqlstress-example.yaml duration=60s ops\(insert=1,simple1=9\) -mode native cql3 protocolVersion=2 -rate threads=3  -log interval=10s hdrfile=throughput.hdr
    Joy, my laptop is giving me awesome throughput of 13,803 pk/s. The summary itself is pretty informative for throughput runs what do we win with the HDR log?
    The log we produce can be summarized, manipulated and graphed by a bunch of utilities. Here's what the log looks like (this is from the fixed mode run):

    Note that while we logged our latency in nanoseconds, the max column is in milliseconds. The nanosecond level measurements are still available in the compressed histogram to the right. Sadly it's not very friendly on the eye. HdrHistogram does include a log processing utility, but it offers quite basic facilities. I've put together a few utilities for histogram log management in HdrLogProcessing. These allow you to split, union and summarize logs and work with the tags feature. Lets make them into handy aliases:

    • hdrsum=java -jar HdrLogProcessing-1.0-SNAPSHOT-jar-with-dependencies.jar SummarizeHistogramLogs
    • hdrspl=java -jar HdrLogProcessing-1.0-SNAPSHOT-jar-with-dependencies.jar SplitHistogramLogs
    • hdruni=java -jar HdrLogProcessing-1.0-SNAPSHOT-jar-with-dependencies.jar UnionHistogramLogs

    The throughput.hdr file we requested uses a recently added HdrHistogram feature which allows for the tagging of different histograms in one log. This makes it easy for applications logging histograms for different measurements to shove them all into a single file rather than many. Since we want to track 2 different operations with their separate response/service/wait times (so up to 6 files) this is quite handy. We can start by using the all in one HdrLogProcessing summary utility to add up the histogram data. The default mode of summary is percentiles and will produce a summary very similar to the one produced above by Cassandra Stress (using the parameters "-if throughput.hdr -ovr 1000000" to summarize in milliseconds):

    We can have the microsecond or nanosecond report by tweaking the outputValueUnitRatio(ovr). We are also free to ignore tags and create an all inclusive summary by specifying the ignoreTag(it) parameter. Used together to create a total summary in microseconds we get ("-if throughput.hdr -ovr 1000 -it):

    As a bonus, the summary tool allows us to process multiple files into the same result (logs from multiple benchmark runs for instance) and selecting periods within the logs to include.
    The summary tool also supports generating the HGRM format which allows us to produce the following graph (using "-if throughput.hdr -ovr 1000 -st hgrm -of tpt" and the google charts provided in HdrHistogram):



    Now imagine we used 3 different machines to stress a single Cassandra node. Because the logs are additive and loss less there's no issue with using the union tool to aggregate them all into a single log and process it as such. Similarly you can use the splitting tool to split out the operation you are interested in and manipulate it in isolation. Indeed the sky is the limit.
    Now, with a stunning 13K partitions per second, an a 50ms 99.99%ile I might think it a reasonable idea to say my server can safely sustain a 10K pk/s rate and call it a day. I shall test this adventurous little estimate using the throttle mode:
    $ ./cassandra-stress user profile=../cqlstress-example.yaml duration=60s ops\(insert=1,simple1=9\) -mode native cql3 protocolVersion=2 -rate throttle=10000/s threads=3 -log interval=10s hdrfile=throttle.hdr


    There according to this, we should have no trouble at all with 10K pk/s. I mean sure it's a bit close to the SLA, but surely no too bad? The throttle mode is in this case just a way to keep your head in the sand a bit longer, but it does record the response time in case you feel like maybe comparing them. Let's compare the service time histogram and the response time histogram for this run. To get both operations service time histograms I need to play with the log processing a bit:

    1. Split the throttle.hdr file:"hdrspl -if throttle.hdr" -> This results in 6 different files <op>-<rt/st/wt>.throttle.hdr.
    2. Summarize the service time histograms:"hdrsum -if .*.-st.throttle.hdr -ovr 1000 -it -st hgrm -of throttle-st" -> we get throttle-st.hgrm
    3. Summarize the service time histograms:"hdrsum -if .*.-rt.throttle.hdr -ovr 1000 -it -st hgrm -of throttle-rt" -> we get throttle-rt.hgrm
    4. Load them into the google charts provided in HdrHistogram.
    Take the red pill and you can keep your estimate, take the blue pill and you might keep your job.
    This is a fine demonstration of just how far from schedule operations can get without service time measurement registering an issue. In other words the red line, service time is the measurement with coordinated-omission, the blue is measurement which does not suffer from it.
    If you are finally convinced that you should be looking at response time instead of service time you can skip the throttle mode and move on straight to the fixed mode. The difference is only in what view of the world you want to see in your summary.
    Finally, you can take the histogram logs and graph the latencies over time using HdrHistogramVisualizer. It currently does not handle tags well, so you'll need to split your logs first, but after that you can generate graphs as funky as this one (this is a longer run with fixed 10K rate, plotting the maximum latency for inserts at the top):
    $ ./cassandra-stress user profile=../cqlstress-example.yaml duration=360s ops\(insert=1,simple1=9\) -mode native cql3 protocolVersion=2 -rate fixed=10000/s threads=3 -log interval=10s hdrfile=fixed.hdr
    $ hdrspl -if fixed.hdr


    This tells an interesting story, it seems as if the server was coping alright with the load to begin with, but hit a bump (GC? compaction?) after 225 seconds, and an even larger bump slightly later. That's lovely to know. I'm not much of a UI wiz and I'm sure some readers out there can make some valuable PRs to this project ;-)

    Summary: What's new?

    Given that with recent version of Cassandra Stress you can now test older versions of Cassandra as well (using the protocolVersion parameter as demonstrated above), you can stop using your crummy old version of Cassandra Stress, build trunk from source to get the following benefits:
    • HdrHistogram latency capturing and logging. Set the following options to get your very own histogram log: "-log interval=10s hdrfile=myawesomelog.hdr".
    • Have a look at the HdrLogProcessing project for some handy utilities to help you slice and dice the data. They are pretty simple and you can feel free to build/contribute your own.
    • 2 new load generation modes: throttle and fixed now replace the old limit. Use fixed to get a view on your cluster's response time under attempted constant load.
    With HdrHistogram support now available in many languages, you can easily build post processing utilities for analysis in a language of your choice.
    Have fun storming the castle!

    Many many thanks to Jake Luciany of the Apache Cassandra project for reviewing the PR, helping to shape it, and merging it it. Jake is an Open Source hero, buy that man a drink on sight!
    Both Jake and Chris Batey reviewed this post, if any errors remain it is down to their disgustingly sloppy and unprofessional ways, please let me know and I shall have words with their parents.