Psychosomatic, Lobotomy, Saw: March 2015

YCSB is the Yahoo Cloud Serving Benchmark(also on wiki): a generic set of benchmarks setting out

The Nimbus Cloud Serving Board

to compare different key-value store providers under a set of loads:

The goal of the Yahoo Cloud Serving Benchmark (YCSB) project is to develop a framework and common set of workloads for evaluating the performance of different "key-value" and "cloud" serving stores.

The code is open for extension and contribution and all that good stuff, you can get it here. And it has become tool for comparing vendors in the NoSQL space. The benchmarks set out to measure latency and throughput. The terms are not directly defined in the paper, but the following statement is made:

The Performance tier of the benchmark focuses on the latency of requests when the database is under load. Latency is very important in serving systems, since there is usually an impatient human waiting for a web page to load. [...] Typically application designers must decide on an acceptable latency, and provision enough servers to achieve the desired throughput while preserving acceptable latency. [...] The YCSB Client allows the user to define the offered throughput as a command line parameter, and reports the resulting latency, making it straightforward to produce latency versus throughput curves.

What could possibly go wrong?™

It can go this wrong for instance, order of magnitude difference in results for different percentiles, leading to some poor decision making on how much hardware you'll need, leading to getting fired from your job and growing old bitter and twisted mumbling to yourself as you get drunk on the street corner until you freeze to death on a winter night. So potentially this is a risk to your future well being, listen up!

It's broken? Coordinated WOT?

When you measure latency bad,
Mr. Tene is sad
:(

My colleague at Azul, Gil Tene, the magnificent, glorious, multi-dimensional, coding CTO, officer and gentleman (that's my bonus sorted) has been doing a lot of latency related preaching and teaching in the last few years. He has given the following talks at any number of conferences, but if you happened to have missed them, watch them NOW:

"How NOT to Measure Latency" - At QCon London, 2013
"Understanding Latency" - At React Conf SF, 2014

In particular he has coined the term "Coordinated Omission" (see raging discussion on Mechanical Sympathy) to describe instances in which the measuring/monitoring system coordinates measurement with the system under test/measurement such that samples are biased. This issue manifests in many load generating frameworks where the call into the system under test is done synchronously and the measurement thread holds up the next sample while the call is ongoing. This enables the system under test to delay requests that would have been made during the synchronous call thus skewing the sample set. Consider for example a system where:

We set out to measure a request every 1ms (from a single thread, synchronously)
The first 500 calls come back in 100µs each (so call K starts at Kms and returns at Kms + 100µs )
Call 501 takes 500 milliseconds (starts at 500ms, returns at 1 second)
Call 502 takes 100µs

See the problem?

The problem is that call 502 did NOT happen at it's designated time, and saying it took 100µs fails to capture this. It failed the assumptions laid out in the first sentence because we were blocked for 500ms. If we were to stick to our original schedule we would be making calls 502 to 1000 in the time it took for call 501 to execute. How should we treat this departure from plan?

Is it safe to come out yet?

Ignore it! it will go away by itself! - This is the coordinated omission way. We are now reporting numbers that are no longer according to the test plan, which means that our "latency versus throughput curves" are off the mark. This is a very common solution to this issue.
Fail the test, we wanted a call every ms and we didn't get that - This is an honest hardline answer, but it potentially throws the baby with the bath water. I think that if you set out to schedule 1000 calls per second you might want to see how often this falls apart and how. But this answer is The Truth™, can you handle it? If one is to start from scratch and write their own load generator I propose a read of the Iago load test framework philosophy page: "Iago accurately replicates production traffic. It models open systems, systems which receive requests independent of their ability to service them. Typical load generators measure the time it takes for M threads to make N requests, waiting for a response to each request before sending the next; if your system slows down under load, these load testers thus mercifully slow down their pace to match. That's a fine thing to measure; many systems behave this way. But maybe your service isn't such a system; maybe it's exposed on the internet. Maybe you want to know how your system behaves when N requests per second come in with no "mercy" if it slows down.". This is a fine sentiment.
Coordinated Omission correction: Adjust the results to reflect the expected call rate. This can be done in a straight forward manner if the 'missing' calls are added back with a latency which reflects the period for which they were delayed. This correction method is supported out of the box by HdrHistogram but the discussion regarding it's over or under estimation of the impact of the delay is outside the scope of this post.
Coordinated Omission avoidance: Measure all calls according to original schedule. We are now saying: "If I can't make the call, the meter is still running!". This is particularly relevant for systems where you would typically be making the requests to the system under test from a thread pool. That thread pool would be there to help you support asynchronous interaction where the API failed in giving you that option. Like JDBC... Like many key-value pair provider APIs.

This last solution is the one we'll go for in this post, but I would urge you to consider the results critically. In particular if you are trying to simulate independent access to a web server (as opposed to a DB via a thread pool) then the adherence to schedule might be hugely optimistic of the results in your case. This is because failing to generate independent load may have all sorts of beneficial effects on the system under test.
For the YCSB benchmark I'm assuming the harness/load generator is simulating a web serving layer accessing the key-value store in an effort to serve an unbounded, uncoordinated user request load via a predefined thread pool. So it's door number 3 for me. The corrected load generator is here.

Step 0: Some preliminary work (not strictly required)

HdrHistogram, as approved by Freddie!

As described previously here, we should all just get on with capturing latency using HdrHistograms. So as a first step toward correcting YCSB I have gone in and added an HdrHistogram measurement container. This is pretty straight forward as all I needed to modify was the Measurements class to allow a new measurement type. While I was there I tweaked this and that and the following list of changes to that class emerged:

Add new measurement type and corresponding command-line option("-p measurementtype=hdrhistogram")
Add combined measurement option allowing old/new measurement side by side: "hdrhistogram+histogram"
Add support for capturing both corrected and uncorrected measurements for the same run.
Use CHM instead of synchronizing around a HashMap.

The new measurement type supports logging loss less HdrHistogram data to a file (controlled by the hdrhistogram.fileoutput=<true|false> option and the hdrhistogram.output.path=<path> option) as well as better precision percentile data and lock free logging of latencies. This is not very interesting work but if you are interested in the "How would I plug in my own data structure to capture latency into YCSB?" topic have fun. It was not necessary for correction but it was good to do so that better quality results can be observed. You're welcome.

Step 1: Demonstrate the issue

YCSB includes a very useful means of verifying the measurements in the form of a mock DB driver. This means we can test our assertions regarding coordinated omission without setting up a key value store of any kind. The mock DB is called BasicDB and is the default DB used. We can configure it to simulate a pause and see what happens (-p basicdb.verbose=false -p basicdb.simulatedelay=4 will make the mock DB stop logging every action and simulate a latency of 0-4ms for each action). I added a further option to the BasicDB which allows us to turn off the randomization of the delay (-p basicdb.randomizedelay=false).

Let's consider our expectations in the case where a DB simply cannot handle request quickly enough. We can setup an experiment with the following settings: -target 1000 -threads 1 -s -p status.interval=1 -p workload=com.yahoo.ycsb.workloads.CoreWorkload -p basicdb.verbose=false -p basicdb.simulatedelay=4 -p basicdb.randomizedelay=false -p measurementtype=hdrhistogram -p maxexecutiontime=60

Here's what they all mean:

-target 1000 -> We aim to test 1000 requests per second
-threads 1 -> We have a single client thread
-s -p status.interval=1 -> We will be printing out status every second (I made the status interval configurable)
-p basicdb.verbose=false -p basicdb.simulatedelay=4 -p basicdb.randomizedelay=false -> The DB will sleep 4ms on each request, so the maximum we can hope for is 250, no noisy printing per operation please
-p measurementtype=hdrhistogram -> Use HdrHistogram to capture the latencies
-p maxexecutiontime=60 -> Run for one minute, then exit and print summary

This DB is obviously failing, it can't keep up with the rate of incoming requests and according to our model they queue up. The time measured per call is reflected in the summary for the READ operations:

[READ], Operations, 12528.0

[READ], AverageLatency(us), 4477.102809706258

[READ], MinLatency(us), 4018.0

[READ], MaxLatency(us), 44703.0

[READ], 95thPercentileLatency(ms), 4.0

[READ], 99thPercentileLatency(ms), 4.0

[READ], Return=0, 12528

But this completely ignores the time spent on the queue. If we were measuring according to schedule we'd get the following set of latencies:

Latency[k] = 4 + 3*(k-1) ms

The max latency would be for the last request to get in. We ran for 60 seconds, at 250 requests/sec which means our last request was (k=15000) and had a latency of 45 seconds when measured from the time we intended to make it. This number reflects the system's failure to handle load far more correctly than the numbers quoted above.

Step 2: Working to Schedule

The YCSB load generator has a weak notion of schedule, in the sense that it opts for option number 1 above and will just execute the operations when it can. When faced with the task of correcting this kind of issue (in a pile of foreign code) we need to look for 2 things in the load generator:

"Scheduling an action to run at time X" - This will involve some calls to one of the many scheduling facilities in the JDK:

Thread.sleep is an old favourite, but TimeUnit also supports a sleep method. A search for sleep in the code base will cover both. This is what YCSB was using to schedule next event to fire.
Code submitting tasks to java.util.Timer, or alternatively the ScheduledExecutorService
Code using LockSupport.parkNanos
Object.wait(...)
others?

"Measuring the length of an operation" - This will involve calls to System.nanoTime() or currentTimeMillis(). For YCSB this is found to happen for example here.

To correct this problem I had to introduce the concept of 'intended start time' to the operations measurement. Schedule for YCSB is specified by the -target command line option which sets the overall number of operations per second to be attempted by the load generator. This is optional, and the default is to go as fast as you can manage, i.e. with no schedule but the back pressure from the system under test to guide us. I'm not sure what a good assumed rate of requests is reasonable in this case, so I did not correct this case. NOTE: If you don't specify target no correction will take place.

The target parameter is translated to a per-thread operation rate (number of threads is set via the threads option, default is 1) so if we have 10 threads, and the target request rate is 1000 (-target 1000 -threads 10) we will have each thread hitting the store with 100 requests per second. The client threads randomize the first operation time to avoid all hitting the store on the same interval. I did some ground work here by setting the units across the board to nanoseconds and naming interval parameters appropriately, nothing too exciting.
The actual correction at it's core involves:

Record the operation's intended start time
Use the intended start time when computing latency

Sadly the way YCSB measures latency does not lend itself to a simple in place fix. The operations are scheduled by the ClientThread which calls into a workload, calling into a DB, which is actually the DBWrapper which measures the latency (for calling into an actual DB implementation) and reports it to the central Measurements singleton. This means that changing the Workload/DB API to include a startTime parameter to each call is quite a far reaching change which would require me to dig through all the DB drivers implementations and would result in a very unpleasant time for all.
I settled on using a thread local on the Measurements object to transfer the start time to the DBWrapper, it is not a nice way to do things (and I'm happy to hear better suggestions) but it does the job without modifying the API.
Once we have:

ClientThread setting up the start time for the operation via Measurements
DBWrapper using the start time from Measurements to compute the operation latency

That's pretty much it. For extra points I wanted to include some facilities to compare measurements before/after the change. These can be removed if we accept HdrHistogram as a replacement and if we accept we only want to measure the intended latency, which would result in a much smaller PR.

Step 3: is the issue solved?

Running the setup from step 1 such that it produces the intended latency as well as the original measurement side by side(-p measurement.interval=both) yields the following result for the READ operations:

[READ], Operations, 12414.0

[READ], AverageLatency(us), 4524.981069759949

[READ], MinLatency(us), 4018.0

[READ], MaxLatency(us), 24703.0

[READ], 95thPercentileLatency(ms), 4.0

[READ], 99thPercentileLatency(ms), 4.0

[READ], Return=0, 12414

[Intended-READ], Operations, 12414.0

[Intended-READ], AverageLatency(us), 2.359010991606251E7

[Intended-READ], MinLatency(us), 4256.0

[Intended-READ], MaxLatency(us), 4.6989311E7

[Intended-READ], 95thPercentileLatency(ms), 42369.0

[Intended-READ], 99thPercentileLatency(ms), 46530.0

This reflects the effect a backed up system would have on latency as we express in Step 1 above. It's actually a bit worse because the average cost of calling the mock DB with a sleep of 4ms is 4.5ms. As we can see the maximum latency is 46.9 seconds, reflecting the fact that the last read to execute was scheduled to hit the system 13.1 seconds into the run.

Step 4: The limitations of the harness

We can now also consider the perfect DB for the sake of observing the short comings of the test harness by setting the mock DB delay to 0(-p basicdb.simulatedelay=0):

[READ], Operations, 56935.0

[READ], AverageLatency(us), 0.01796785808377975

[READ], MinLatency(us), 0.0

[READ], MaxLatency(us), 49.0

[READ], 95thPercentileLatency(ms), 0.0

[READ], 99thPercentileLatency(ms), 0.0

[READ], Return=0, 56935

[Intended-READ], Operations, 56935.0

[Intended-READ], AverageLatency(us), 232.37026433652412

[Intended-READ], MinLatency(us), 0.0

[Intended-READ], MaxLatency(us), 39007.0

[Intended-READ], 95thPercentileLatency(ms), 0.0

[Intended-READ], 99thPercentileLatency(ms), 0.0

How come it take so long to measure a noop? why such large differences? Here's some generic theories and how they panned out:

The JVM running the load generator is running with suboptimal settings(-Xms64m -Xmx64m -XX:+PrintGCDetails -XX:+PrintGCApplicationStoppedTime, Oracle JDK8u31) on a busy Mac laptop running on battery

This is no way to benchmark anything, but the interesting thing is that if we have no schedule to stick to the test harness is willing to just ignore the issue. If we run on a decent machine (with a decent OS) we get nicer results. This is from a server class machine running CentOS6.3/OracleJDK8u25 with same settings:

[READ], Operations, 56930.0

[READ], AverageLatency(us), 0.44417705954681186

[READ], MinLatency(us), 0.0

[READ], MaxLatency(us), 20.0

[READ], 95thPercentileLatency(ms), 0.0

[READ], 99thPercentileLatency(ms), 0.0

[READ], Return=0, 56930

[Intended-READ], Operations, 56930.0

[Intended-READ], AverageLatency(us), 146.31262954505533

[Intended-READ], MinLatency(us), 15.0

[Intended-READ], MaxLatency(us), 14255.0

[Intended-READ], 95thPercentileLatency(ms), 0.0

[Intended-READ], 99thPercentileLatency(ms), 0.0

This is still significant.

The JVM suffers from warmup related artefacts

This certainly correlated to the max values I'm seeing here. When looking at the status line for the first second I see:

[READ: Count=22, Max=14, Min=0, Avg=0.64, 90=0, 99=14, 99.9=14, 99.99=14]

[Intended-READ: Count=23, Max=14255, Min=15, Avg=5563.39, 90=13287, 99=14255, 99.9=14255, 99.99=14255]
But after a few seconds the process settles and we see much better results, this is typical:
[READ: Count=947, Max=14, Min=0, Avg=0.02, 90=0, 99=0, 99.9=2, 99.99=14]
[Intended-READ: Count=946, Max=194, Min=61, Avg=151.29, 90=165, 99=175, 99.9=186, 99.99=194]
A good way to handle this issue is by relying on the HdrHistogram output files to grab relevant time periods for analysis. With the original YCSB output we have the percentile summary data, but this is not something we can combine for analysis. With the loss-less interval histogram logs we can look at any sub-period(which is longer than one interval, but shorter than the whole run) and get accurate full range histogram data. A common practice is to discard warmup period results, I'm no a fan of throwing away data, but since this is the load generator warmup I'd think it's quite legitimate. It's perhaps an interesting feature to add to such a framework that the framework can be warmed up separately from the system to examine cold system behaviour.

Thread.sleep/LockSupport.parkNanos are not super accurate and may wakeup after the intended operation start time

I've added an option for spinning instead of sleeping (so burn a CPU). This has improved the average value dramatically from ~146µs to ~3.1µs. A typical status line now looks like:

[READ: Count=947, Max=13, Min=0, Avg=0.02, 90=0, 99=0, 99.9=3, 99.99=13]

[Intended-READ: Count=948, Max=57, Min=0, Avg=0.47, 90=1, 99=12, 99.9=26, 99.99=57]

It is obviously not desirable for the load generator to burn a CPU instead of using sleep, but it also introduces scheduling inaccuracies. This is an accuracy issue we didn't have to deal with when not measuring from a schedule. This didn't impact the measured outliers, but has dramatically reduced their number. The take away here is just that there are accuracy limitations to the load generators ability to stick to schedule.

GC pauses that are large enough to derail the schedule on the load generator side are now captured. Unless the GC pauses happen inside the measurement gap we will have no idea we have gone off schedule if we don't track the intended start time.

We should capture GC logs on load generator side and make sure we correlate the GC events with recorded latencies. Here's a GC pause being captured by the corrected measurement:

[READ: Count=952, Max=0, Min=0, Avg=0, 90=0, 99=0, 99.9=0, 99.99=0]

[Intended-READ: Count=952, Max=14, Min=0, Avg=0.03, 90=0, 99=0, 99.9=3, 99.99=14]

[GC (Allocation Failure) [PSYoungGen: 17895K->1824K(18944K)] 17903K->1840K(62976K), 0.0024340 secs] [Times: user=0.01 sys=0.01, real=0.01 secs]

Total time for which application threads were stopped: 0.0026392 seconds

[READ: Count=957, Max=0, Min=0, Avg=0, 90=0, 99=0, 99.9=0, 99.99=0]

[Intended-READ: Count=957, Max=2719, Min=0, Avg=5.21, 90=0, 99=0, 99.9=2119, 99.99=2719]

This process is running with a 64M heap, you can expect longer pauses as the heap grows (in particular as the young generation grows).

The operation setup time is now being measured as well as the operation itself.

When running with the spin option we can see the average operation cost is ~3.1µs, this is all test harness overhead and is really quite negligible in the context of network hopping operations. In other words, nothing to worry about for this harness but could well prove an issue for others.

Step 5: The Good, The Bad And The STW pausing DB

Many software processes have a latency profile that is far from normally distributed. To see what YCSB makes of this kind of profile now that we can compare corrected vs. uncorrected measurement I have built a mock DB that has 4 mods of latency (p is uniform random number [0,1]):

Awesome (p < 0.9): we return in 200µs-1ms
Good (0.9 < p < 0.99): we return in 1-10ms
Minor Hiccup( 0.99 < p < 0.9999): we hit a bump, but only one thread is affected 10-50ms
Major Hiccup(0.9999 < p): we hit a STW pause(because GC/THP/LBJ/STD/others), all threads stop for 50-200ms

I implemented the above with a read write lock, where the STW pause grabs the write lock and all the others grab the read lock. This is far from a perfect representation of a system (everyone waits for STW as intended, but also STW waits for everyone to start), but it will do. If you feel strongly that a better simulation is in order, write one and let's discuss in the comments!
What sort of profile will we see? How far off course will our measurements be if we don't stick to schedule? Here's this setup run at a rate of 10,000 requests per second, with 25 threads (so each thread is trying for 250 reqs/sec or 1 request per 4ms):
[READ], Operations, 569516.0
[READ], AverageLatency(us), 1652.1852871561116
[READ], MinLatency(us), 210.0
[READ], MaxLatency(us), 142463.0
[READ], 95thPercentileLatency(ms), 1.0
[READ], 99thPercentileLatency(ms), 19.0
[READ], Return=0, 569516

According to these numbers, the max is quite high but the overall impact of hiccups is not too severe (all depends on your requirements of course). Even at this stage we can see that the effect of global pauses is skewing the other measurements (if you hit a short operation while a STW pause is in progress you still have to wait for the STW event to finish).
The corrected measurements tell a different story:
[Intended-READ], Operations, 569516.0
[Intended-READ], AverageLatency(us), 24571.6025835973
[Intended-READ], MinLatency(us), 268.0
[Intended-READ], MaxLatency(us), 459519.0
[Intended-READ], 95thPercentileLatency(ms), 83.0
[Intended-READ], 99thPercentileLatency(ms), 210.0

How can this be right? Can this be right?

At a rate of 10000 request per second, the unlikely Major Hiccup is likely to happen every second. Consider this next time someone tells you of a 99.99%ile behaviour. Given an event rate of 10K per second, 99.99% is suddenly not very rare. Consider that at this rate there's likely to be a few events that are worse.
The average major hiccup is 125ms long, in this time 125/4 events are delayed on all 25 threads -> 125 * 25 / 4 = 781 events are delayed from starting, they will further delay each other as they execute. In roughly 12 seconds we can see how it is quite probable that one of these events is another major hiccup. What with all the queuing up behind the first one etc, the pileup becomes quite reasonable.
The probability of a 'mode' is not the probability of the per event latency once STW and queuing effects are in play.

I've made the mock DB print out 'OUCH' every time we get slapped with a STW event. It turns out that we got very unlucky in this run and hit three of these in a row:
56 sec:
[READ: Count=9192, Max=83903, Min=238, Avg=1745.13, 90=1531, 99=26959, 99.9=79551, 99.99=83775]
[Intended-READ: Count=9208, Max=159999, Min=303, Avg=16496.92, 90=54271, 99=103807, 99.9=150527, 99.99=158335]
OUCH
OUCH
OUCH
57 sec:
[READ: Count=9642, Max=129727, Min=247, Avg=2318, 90=1799, 99=40607, 99.9=125631, 99.99=127359]
[Intended-READ: Count=9635, Max=459519, Min=320, Avg=102971.39, 90=200319, 99=374271, 99.9=442367, 99.99=457983]

This is quite telling.
The view on what's the worst second in this run is wildly different here. Because the uncorrected measurement takes each event as it comes it will take the view that 75 events were delayed by these hiccups, and none by more than 130ms. But from the corrected measurement point of view all the queued up measurements were effected and were further delayed by each other.
I've re-run, this time logging interval histograms in their compressed form for every second in the run. Logging a 60 seconds run with 1 second interval data cost me 200k (we can tweak the construction in OneMeasurementHdrHistogram to minimize the cost). I can take the compressed logs and use the HistogramLogProcessor script provided with HdrHistogram to process the logs (you need to copy the HdrHistogram.jar into the script folder first). Running:

./HistogramLogProcessor -i READ.hdr -o uncorrected -outputValueUnitRatio 1000

./HistogramLogProcessor -i Intended-READ.hdr -o corrected -outputValueUnitRatio 1000

Will produce *.hgrm files for both. I then use the plotFiles.html to generate the following comparison:

They tell very different stories don't they.
The red line will have you thinking your system copes gracefully up to the 99%ile slowly degrading to 20ms, when measuring correctly however the system is shown to degrade very quickly with the 20ms line crossed as early as the median, and the 99%ile being 10 times the original measurement. The difference is even more pronounced when we look at one of those terrible seconds where we had back to back STW hiccups. I can use the HistogramLogProcessor script to produce partial summary histograms for the 3 seconds around that spike:

./HistogramLogProcessor -i Intended-READ.hdr -o correctedOuch3 -outputValueUnitRatio 1000 -start 1425637666.488 -end 1425637668.492

Similarly we can compare a good second with no STW pauses:

Summary

Coordinated Omission is a common problem in load generators (and other latency reporters), we had a look at fixing YCSB, an industry standard load generator:

Replaced the data structure used to capture latency with HdrHistogram: that is just generally useful and gives us better data to work with when examining the corrected measurement
Found scheduling code and introduced notion of operation start time.
Found measuring code and captured both operation cost (uncorrected measurement) and scheduled time latency (corrected measurement).
Use a mock system under test to evaluate measurement of known scenario. This is a very handy thing to have and luckily YCSB had this facility in place. In other places you may have to implement this yourself but it's a valuable tool to have in order to better understand the measurement capabilities of your harness. This helped highlight the scale of scheduling inaccuracies and test harness overhead per operation, as well as the scale of test harness error during its own warmup period.
Use HdrHistogram facilities to visualise and analyse latency histogram data from the compressed histogram logs.

Thanks goes to this posts kind reviewers: Peter Huges, Darach, and Philip Aston

Psychosomatic, Lobotomy, Saw

Wednesday, 11 March 2015

Correcting YCSB's Coordinated Omission problem