Psychosomatic, Lobotomy, Saw: July 2016

I have it from reliable sources that incorrectly measuring latency can lead to losing ones job, loved ones, will to live and control of bowel movements. Out of my great love for fellow humans I have fixed the YCSB load generator to avoid the grave danger that is Coordinated Omission, this was met with great joy and I now gone on to fixup Cassandra Stress in similar vein. The fix is now merged into trunk, so expect to have it with the next Cassandra release (or just build it from source NOW).
Before we start on the how and what, let us take a step back and consider the why:

Coordinated Omission is a term coined by my esteemed colleague Gil Tene to describe a common measurement anti pattern where the measuring system inadvertently coordinates with the system under measurement (watch Gil's How NOT to Measure Latency talk). This results in greatly skewed measurements in commonly used load generators, and has led to long discussions on Mechanical Sympathy mailing list etc. I've given my own explanation in a previous post, so go and have a read if you need a refresher on the subject.
Cassandra Stress (CS) is a tool which comes bundled with Cassandra to enable load testing of a given cluster. It allows the user to specify their own queries and schemas and is the prefered tool for Cassandra benchmarking as it gives better access to Cassandra features etc. CS allows 2 testing modes: throughput(default) or 'limit' where a given number of operations per second are thrown at the cluster (the fix discussed here does away with limit and replaces it with throttle/fixed, read on).

So coordinated omission is bad, and it is now fixed in Cassandra Stress. This post talks a bit on motivation (part I), a bit on implementation (part II) and a bit on what you can do with the new features (part III). Feel free to skip any and all parts, god knows those selfies don't take themselves.

PART I: Demonstrating The Issue

CS default mode is to hit the system under test as hard as it can. This is a common strategy in load generators and can give system writers an interesting edge case to work with. I run the benchmark on my laptop (no attempt at finding out real performance numbers here, I just care about measurement issue) with the example provided workload I can saturate my Cassandra server (I only gave it a single core to run on) pretty easily. CS tells me the following about my miserable little server:

INFO 15:02:46 Results:

INFO 15:02:46 Op rate : 5,055 op/s [insert: 500 op/s, simple1: 4,556 op/s]

INFO 15:02:46 Partition rate : 17,294 pk/s [insert: 12,739 pk/s, simple1: 4,556 pk/s]

INFO 15:02:46 Row rate : 150,266 row/s [insert: 12,739 row/s, simple1: 137,527 row/s]

INFO 15:02:46 Latency mean : 4.5 ms [insert: 7.5 ms, simple1: 4.2 ms]

INFO 15:02:46 Latency median : 3.0 ms [insert: 5.4 ms, simple1: 2.8 ms]

INFO 15:02:46 Latency 95th percentile : 13.1 ms [insert: 20.1 ms, simple1: 11.9 ms]

INFO 15:02:46 Latency 99th percentile : 23.8 ms [insert: 33.7 ms, simple1: 21.8 ms]

INFO 15:02:46 Latency 99.9th percentile : 49.8 ms [insert: 55.0 ms, simple1: 49.0 ms]

INFO 15:02:46 Latency max : 603.5 ms [insert: 598.7 ms, simple1: 603.5 ms]

INFO 15:02:46 Total partitions : 1,000,000 [insert: 736,585, simple1: 263,415]

INFO 15:02:46 Total errors : 0 [insert: 0, simple1: 0]

INFO 15:02:46 Total GC count : 112

INFO 15:02:46 Total GC memory : 34.850 GiB

INFO 15:02:46 Total GC time : 1.0 seconds

INFO 15:02:46 Avg GC time : 9.4 ms

INFO 15:02:46 StdDev GC time : 16.6 ms

INFO 15:02:46 Total operation time : 00:00:57

With the other 3 cores on my laptop hitting it as hard as they could the median latency on this maxed out server was 3ms. That is pretty awesome. But also, it makes no sense.

How can a maxed out server have a typical response time of 3ms? In reality when servers are maxed out they are unresponsive, the typical response time becomes worse as the load increases. What CS is reporting however is not response time. It is 'latency'. Latency is one of those terms people use to mean many things and in this case in particular it does not mean "response time" but rather "service time". Here's a definition of more specific terms to describe system responsiveness(see wiki on response time):

Response time: The time between the submission of a request and the completion of the response.
Service time: The time between the initiation and completion of the response to a request.
Wait time: The time between the submission of the request and initiation of the response.

In an all out test one could argue we want all the results as soon as possible, and given a magic load generator they would all launch instantaneously at the benchmark start. This will mean submission time for all requests is at the beginning of the run. Naturally the server will not be able to respond instantly to all requests and can only allow so many requests to be handled in parallel. If the max throughput is 5000 ops/sec, and we are measuring for 100,000 ops it would mean 95K ops have waited a second, so their response time would be at least 1 second (response time > wait time). By the end of the run we would have 5K ops which have patiently waited at least 19 seconds (so 99%ile should be at least 19 seconds).

It follows that in an all out throughput benchmark response time is terrible by definition, and completely uninformative. It also follows that we should not expect the 'latency' above to be at all indicative of the sort of response time we would get from this server.

The alternative to an all out throughput benchmark is a Responsiveness Under Load (RUL) benchmark. Using Cassandra Stress one can (or rather they could before this fix went in) use the '-rate limit=17000/s' option to benchmark under a load of 17k pks/sec.(pks = partition keys, each operation costs X keys, throughput limit is specified in pks not ops) Running this gives me a warm fuzzy feeling, now for sure I shall get a glimpse of the response time at max throughput:

INFO 08:03:54 Results:

INFO 08:03:54 Op rate : 3,712 op/s [insert: 369 op/s, simple1: 3,343 op/s]

INFO 08:03:54 Partition rate : 12,795 pk/s [insert: 9,452 pk/s, simple1: 3,343 pk/s]

INFO 08:03:54 Row rate : 110,365 row/s [insert: 9,452 row/s, simple1: 100,913 row/s]

INFO 08:03:54 Latency mean : 1.0 ms [insert: 1.6 ms, simple1: 0.9 ms]

INFO 08:03:54 Latency median : 0.7 ms [insert: 1.3 ms, simple1: 0.7 ms]

INFO 08:03:54 Latency 95th percentile : 2.2 ms [insert: 3.4 ms, simple1: 2.0 ms]

INFO 08:03:54 Latency 99th percentile : 4.6 ms [insert: 7.4 ms, simple1: 4.1 ms]

INFO 08:03:54 Latency 99.9th percentile : 13.4 ms [insert: 23.8 ms, simple1: 12.1 ms]

INFO 08:03:54 Latency max : 63.9 ms [insert: 59.9 ms, simple1: 63.9 ms]

INFO 08:03:54 Total partitions : 300,000 [insert: 221,621, simple1: 78,379]

INFO 08:03:54 Total errors : 0 [insert: 0, simple1: 0]

INFO 08:03:54 Total GC count : 33

INFO 08:03:54 Total GC memory : 10.270 GiB

INFO 08:03:54 Total GC time : 0.2 seconds

INFO 08:03:54 Avg GC time : 7.5 ms

INFO 08:03:54 StdDev GC time : 2.5 ms

INFO 08:03:54 Total operation time : 00:00:23

This seems nice, and if I were not a suspicious man I might accept it. The thing is, I asked for 17k pks per second, but I only got 12,795 pk/s so obviously the server could not meet the implied schedule. If it could not meet the schedule, response time should be terrible. But it's not, it's very similar to the result we got above. Because? Because again, latency here means service time and not response time. While response time is not informative in an all out test, in an RUL benchmark it is the whole purpose of the benchmark. I have a schedule in mind, requests come at a particular rate, which implies they have a known start time (request n will start at: T0 + n/rate, T0 being the start of the test). This is coordinated omission, lets fix it.

Part II(a): It Puts The Measurements In The HdrHistogram or It Gets The Hose Again!

First off, I like to have HdrHistogram files to work with when looking at latency data (made a whole post about it too). They are usually tons better than whatever it is people hand roll, and they come with a bunch of tooling that makes data post processing easy. Importantly HdrHistograms are loss-less, configurable compact and support loss-less and compressed logging. Combine that with the high resolution of data and you have a great basis for post run analysis.
Cassandra Stress had it own sampling latency collection mechanism which would randomly drop some samples as a means to improve memory footprint, so the replacement improved reporting accuracy and reduced the amount of code. A side effect of this change is that Cassandra perf regression testing stability has improved rather dramatically. As indicated by this graph:

Because of the random sampling 99.9%ile reporting was unstable before the patch went in (May 28th), but smoooooth ever since. Ain't that nice?

Here's the commit with the main changes on the branch that became the PR, there were further changes introduced to enable histogram logging to files. The change is mostly tedious, but importantly it introduces the notion of split response/service/wait histograms. The data is recorded as follows:

The question is, what is the intended start time?

Part II(b): Where Do I Start?

Load generators, when not going all out, generally have some notion of schedule. It is more often than not quite simply a fixed rate of requests, though the notion of schedule holds regardless of how you come up with it. A schedule in this context means that for any event X there is a time when it should happen: st(X). That time is easily computed in a fixed rate schedule as: "st(n) = T0 + n/rate". Cassandra Stress however was using google's RateLimiter to provide it with the scheduling, and while battle tested and handy it does not have a notion of schedule. The replacement took place in 2 steps.

First I refactored the existing code into hiding the details of how operations are scheduled and where they come from behind a blocking queue like interface. The next step was to support a fixed rate stream of operations where the intended schedule is available so I can use it. This is what I ended up with (further tweaked to only start the clock when all the consumer initialization has happened):

Now we're all set to benchmark with no Coordinated Omission!

Part II(c): Who Reports What?

The measurement collection now all set up we face the question of what should different load generating modes report. Since it turned out that 'limit' was a confusing name (hard limit? upper limit?) it was decided to replace it with 2 distinct modes, adding to a total of 3 modes:

Throughput: latency == service time, response/wait time not recorded. Maximum throughput is an important test mode for flushing out CPU bottlenecks and contention. It may help in exposing IO configuration issues as well. It is not however a good predictor of response time distribution as no attempt is made to estimate independent request timing (i.e uncoordinated measurement). The maximum throughput number is valuable to batch processing etc, but I'd caution against using it for the sizing of a responsive system. If a system has responsiveness SLAs it is better to use the 'fixed' test mode and run for significant periods to determine the load under which response times SLA can be met.
Throttle ("-rate throttle=1000/s"): latency == service time, response/wait time recorded like in fixed. This mode is a compromise for people who liked the old 'limit' measurement and want to measure "service time under attempted load". The partitions/second parameter is attempted for, but the summary does not reflect the response time. If you like, this is a way to sneakily record the response time while keeping the summary as it is so people are not so startled by the difference. I don't see myself using it, but have a blast.
Fixed ("-rate fixed=1000/s"): latency == response time, service/response/wait time recorded. This mode is for benchmarking response time under load. The partitions/second parameter is used to determine the fixed rate scheduled of the requests. The hdr log can be used to visualize latencies over time or aggregate latencies for any segment down to the logging frequency. The logs contain response/wait/ service components that can be extracted and handled separately.

In all of the above you can choose to record the HDR histograms to a file on a fixed interval. The interval is the one used for summary reporting. To enable histogram logging use: " -log interval=10s hdrfile=cs.hdr"
Note that the interval should be quite long, and the default of 1s is not actually achievable by Cassandra Stress at the moment. This is down to the terrible way the reporting thread synchronizes with the load generation threads, and while it is one my wish list ("I wish I had time to fix it...") it was outside the scope of fixing CO, so lived to taunt us another day. I've settled on 10 second intervals for now, you can have even longer ones, all depends on the sort of granularity you want in your reports.

Part III: What Is It Good For?

So we got 3 load generating modes, and we've broken up latency into 3 components, the numbers are in histograms, the histograms are in the logs... let's have a little worked example.

I run a Cassandra 2.1 node on my laptop (old software, bad hardware... I don't care. This is not about absolute numbers, it's about the measurement features). To run this please:

Clone the Cassandra trunk: git clone https://github.com/apache/cassandra.git
Build Cassandra and Cassandra Stress: ant clean build stress-build jar

I use the example workload provided by the good Cassandra folks, and start with an all out stress test from cassandra/tools/bin:

$ ./cassandra-stress user profile=../cqlstress-example.yaml duration=60s ops$insert=1,simple1=9$ -mode native cql3 protocolVersion=2 -rate threads=3 -log interval=10s hdrfile=throughput.hdr

Joy, my laptop is giving me awesome throughput of 13,803 pk/s. The summary itself is pretty informative for throughput runs what do we win with the HDR log?
The log we produce can be summarized, manipulated and graphed by a bunch of utilities. Here's what the log looks like (this is from the fixed mode run):

Note that while we logged our latency in nanoseconds, the max column is in milliseconds. The nanosecond level measurements are still available in the compressed histogram to the right. Sadly it's not very friendly on the eye. HdrHistogram does include a log processing utility, but it offers quite basic facilities. I've put together a few utilities for histogram log management in HdrLogProcessing. These allow you to split, union and summarize logs and work with the tags feature. Lets make them into handy aliases:

hdrsum=java -jar HdrLogProcessing-1.0-SNAPSHOT-jar-with-dependencies.jar SummarizeHistogramLogs
hdrspl=java -jar HdrLogProcessing-1.0-SNAPSHOT-jar-with-dependencies.jar SplitHistogramLogs
hdruni=java -jar HdrLogProcessing-1.0-SNAPSHOT-jar-with-dependencies.jar UnionHistogramLogs

The throughput.hdr file we requested uses a recently added HdrHistogram feature which allows for the tagging of different histograms in one log. This makes it easy for applications logging histograms for different measurements to shove them all into a single file rather than many. Since we want to track 2 different operations with their separate response/service/wait times (so up to 6 files) this is quite handy. We can start by using the all in one HdrLogProcessing summary utility to add up the histogram data. The default mode of summary is percentiles and will produce a summary very similar to the one produced above by Cassandra Stress (using the parameters "-if throughput.hdr -ovr 1000000" to summarize in milliseconds):

We can have the microsecond or nanosecond report by tweaking the outputValueUnitRatio(ovr). We are also free to ignore tags and create an all inclusive summary by specifying the ignoreTag(it) parameter. Used together to create a total summary in microseconds we get ("-if throughput.hdr -ovr 1000 -it" ):

As a bonus, the summary tool allows us to process multiple files into the same result (logs from multiple benchmark runs for instance) and selecting periods within the logs to include.
The summary tool also supports generating the HGRM format which allows us to produce the following graph (using "-if throughput.hdr -ovr 1000 -st hgrm -of tpt" and the google charts provided in HdrHistogram):

Now imagine we used 3 different machines to stress a single Cassandra node. Because the logs are additive and loss less there's no issue with using the union tool to aggregate them all into a single log and process it as such. Similarly you can use the splitting tool to split out the operation you are interested in and manipulate it in isolation. Indeed the sky is the limit.
Now, with a stunning 13K partitions per second, an a 50ms 99.99%ile I might think it a reasonable idea to say my server can safely sustain a 10K pk/s rate and call it a day. I shall test this adventurous little estimate using the throttle mode:
$ ./cassandra-stress user profile=../cqlstress-example.yaml duration=60s ops$insert=1,simple1=9$ -mode native cql3 protocolVersion=2 -rate throttle=10000/s threads=3 -log interval=10s hdrfile=throttle.hdr

There according to this, we should have no trouble at all with 10K pk/s. I mean sure it's a bit close to the SLA, but surely no too bad? The throttle mode is in this case just a way to keep your head in the sand a bit longer, but it does record the response time in case you feel like maybe comparing them. Let's compare the service time histogram and the response time histogram for this run. To get both operations service time histograms I need to play with the log processing a bit:

Split the throttle.hdr file:"hdrspl -if throttle.hdr" -> This results in 6 different files <op>-<rt/st/wt>.throttle.hdr.
Summarize the service time histograms:"hdrsum -if .*.-st.throttle.hdr -ovr 1000 -it -st hgrm -of throttle-st" -> we get throttle-st.hgrm
Summarize the service time histograms:"hdrsum -if .*.-rt.throttle.hdr -ovr 1000 -it -st hgrm -of throttle-rt" -> we get throttle-rt.hgrm
Load them into the google charts provided in HdrHistogram.

Take the red pill and you can keep your estimate, take the blue pill and you might keep your job.
This is a fine demonstration of just how far from schedule operations can get without service time measurement registering an issue. In other words the red line, service time is the measurement with coordinated-omission, the blue is measurement which does not suffer from it.
If you are finally convinced that you should be looking at response time instead of service time you can skip the throttle mode and move on straight to the fixed mode. The difference is only in what view of the world you want to see in your summary.
Finally, you can take the histogram logs and graph the latencies over time using HdrHistogramVisualizer. It currently does not handle tags well, so you'll need to split your logs first, but after that you can generate graphs as funky as this one (this is a longer run with fixed 10K rate, plotting the maximum latency for inserts at the top):
$ ./cassandra-stress user profile=../cqlstress-example.yaml duration=360s ops$insert=1,simple1=9$ -mode native cql3 protocolVersion=2 -rate fixed=10000/s threads=3 -log interval=10s hdrfile=fixed.hdr
$ hdrspl -if fixed.hdr

This tells an interesting story, it seems as if the server was coping alright with the load to begin with, but hit a bump (GC? compaction?) after 225 seconds, and an even larger bump slightly later. That's lovely to know. I'm not much of a UI wiz and I'm sure some readers out there can make some valuable PRs to this project ;-)

Summary: What's new?

Given that with recent version of Cassandra Stress you can now test older versions of Cassandra as well (using the protocolVersion parameter as demonstrated above), you can stop using your crummy old version of Cassandra Stress, build trunk from source to get the following benefits:

HdrHistogram latency capturing and logging. Set the following options to get your very own histogram log: "-log interval=10s hdrfile=myawesomelog.hdr".
Have a look at the HdrLogProcessing project for some handy utilities to help you slice and dice the data. They are pretty simple and you can feel free to build/contribute your own.
2 new load generation modes: throttle and fixed now replace the old limit. Use fixed to get a view on your cluster's response time under attempted constant load.

With HdrHistogram support now available in many languages, you can easily build post processing utilities for analysis in a language of your choice.
Have fun storming the castle!

Many many thanks to Jake Luciany of the Apache Cassandra project for reviewing the PR, helping to shape it, and merging it it. Jake is an Open Source hero, buy that man a drink on sight!
Both Jake and Chris Batey reviewed this post, if any errors remain it is down to their disgustingly sloppy and unprofessional ways, please let me know and I shall have words with their parents.

Psychosomatic, Lobotomy, Saw

Monday 18 July 2016

Fixing Coordinated Omission in Cassandra Stress