Wednesday, 22 April 2015

Porting Pitfalls: Turning D.Vyukov MPSC Wait-free queue into a j.u.Queue

{This post is part of a long running series on lock free queues, checkout the full index to get more context here}
D. Vyukov is an awesome lock-free dude, and I often refer to his instructive and invaluable website 1024cores.net in my posts. On his site he covers lock free queue implementations and in particular a wait-free MPSC linked node queue. This is really rather special when you consider that normally MP would imply lock-free rather than wait-free guarantees. I've ported his algorithm to Java (and so have many others: Netty/Akka/RxJava etc.), and had to tweak it to match the Queue interface. In this post I'd like to explain the algorithm, it's translation to Java, and the implications of making it a j.u.Queue.

Lock free vs. Wait free

Let's review the definitions:
  • Wait-free: thread progress is guaranteed, all operations finish in a determined number of steps.
  • Lock-free: global progress is guaranteed, though a particular thread may be blocked.

An example of a transition from lock free to wait free is available with JDK8 changes to AtomicReference::getAndSet(). The change was made by utilizing the newly available Unsafe::getAndSetObject intrinsic which translates directly to XCHG (on x86). So where we used to have for AtomicReference:
T getAndSet(T newVal) {
  T currentValue;
  do {
  currentValue = val; // val is a volatile field
  } while (!Unsafe.compareAndSwapObject(this, VAL_FIELD_OFFSET, currentValue, newValue));
  return currentValue;
}
Now we have:
T getAndSet(T newVal) {
  return Unsafe.getAndSetObject(this, VAL_FIELD_OFFSET, newValue);
}
I discussed a similar change to AtomicLong.getAndAdd in a previous post, replacing the CAS loop with LOCK XADD.

The Vyukov Wait Free MPSC queue

This is a LinkedList type structure and the interesting methods are offer and poll, here's the original (I did some formatting):
Awesome in it's simplicity, deceptive in it's genius. Be aware that head/tail meaning is the other way around than what most people are used to. I personally go for producer/consumerNode (head = producer side, tail = consumer side in the above snippet) in my code, but for consistency I'll stick with Mr. Vs notation for the porting exercise.
But how do we manage the same memory barriers in Java? We can be nice about it and use the AtomicFieldUpdater or more brutal and use Unsafe. I find you get better performance with Unsafe, but you should consider how appropriate this is for you. In any case, here's what we end up with:
The code is pretty similar. Now if we wanted to complete the j.u.Queue we could extend AbstractQueue and implement only size()/peek() and we're done (I'm not going to bother with the iterator()):
There we go, seems reasonable don't it? off to the pub to celebrate some kick ass lock free programming!

Vyukov::poll() ain't Queue::poll!

I've discussed this particular annoyance in a previous post which has some overlap with this one. The names are the same, but as it turns out the guarantees are quite different. While Vyukov is doing a fine job implementing a queue, not any queue is a j.u.Queue. In particular for this case, the poll() method has a different contract:
  • j.u.Queue: Retrieves and removes the head of this queue, or returns null if this queue is empty.
  • Vyukov: Retrieves and removes the head of this queue, or returns null if next element is not available.
Why wouldn't the next element be available? Doesn't that mean the queue is empty? Sadly this ain't the case. Imagine for instance there are 2 producers and we run both threads step by step in a debugger. We break at line 33 (4th line of offer, the Java version):
  1. Producer 1: we step 1 line, we just replaced head with node n. We are suspended before executing line 35.
  2. Producer 2: we let the program continue. We've replaced head and linked it to the previous head.
What state are we in? Let's look at head/tail nodes and where they lead (I number the nodes in order of assignment to head):
  • head = Node[2], this is the node created by Producer 2. We also know that Node[1].next = Node[2], because we let offer run it's course on Producer 2.
  • tail = Node[0] the node we allocated in the constructor. This is the node head was before the first producer came along. This is what prev is equal to for Producer 1, but because we suspended that thread it never set it's next value. Node[0].next is still null!

If a consumer came along now they would get a null from poll(), indicating the queue is empty. But the queue is obviously not empty!
So it seems we cannot deduce the queue is empty from looking at tail.next here's 2 valid indicators that the queue is empty:
  • head == tail : this is the starting point set in the constructor and where the consumer ends up after consuming the last element
  • head.val == null : head can only have a value of null if it is tail
Here's a solution to a correct poll() and the knock on effect on peek():
This is a bit annoying because now we have a wait-free offer(), but poll() and peek() are lock-free(only block one thread, producers can make progress).
This pitfall is tricky enough that not only did I fall for it (on another queue algorithm, but same mistake), it also took by surprise Mr. Manes writing an interesting variant on this queue for his high performance cache implementation (I filed an issue which he promptly fixed), and struck the notorious Dave Dice when considering Vyukov's MPMC queue (see comments for discussion).
So is this it? Are we done?
Almost... size() is still broken.

A Good Size

It's not really surprising that size is broken given the terminating condition for the size loop was relying on next == null to terminate the count. Size was also broken in 2 other subtle ways:
  • The interface for size dictates that it returns a positive int. But given that the queue is unbounded it is possible (though very unlikely) for it to have more than 2^31 elements. This would require that the linked list consume over 64GB (16b + 8b + 8b=32b per element, refs are 8b since this requires more than 32GB heap so no compressed oops). Unlikely, but not impossible. This edge condition is handled for ConcurrentLinkedQueue (same as here), and for LinkedBlockingQueue (by bounding it's size to MAX_INT, so it's not really unbounded after all) but not for LinkedList (size is maintained in an int and is blindly incremented).
  • Size is chasing a moving target and as such can in theory never terminate as producers keep adding nodes. We should limit the scope of the measurement to the size of the queue at the time of the call. This is not done for CLQ and should perhaps be considered.
Here's the end result:
If you need a full SPSC/MPSC linked queue implementation they are available in JCTools for your pleasure, enjoy!

Special thanks to Doug and Ben for reviewing!

Monday, 13 April 2015

On Arrays.fill, Intrinsics, SuperWord and SIMD instructions

{This post turned rather long, if you get lazy feel free to skip to the summary}
Let's start at the very beginning, a very good place to start... My very first post on this blog was a short rant on intrinsics, and how they ain't what they seem. In that post I made the following statement:
"intrinsic functions show up as normal methods or native methods"
Which is correct. An intrinsic function is applied as a method substitution. A method call will appear in the code and the compiler will replace it's apparent source-level implementation with a pre-cooked implementation. In some cases intrinsics are sort of compilation cheats, the idea being that some bits of functionality are both very important (i.e. worth while optimizing) and can benefit from a hand crafted solution that will be better than what the compiler can achieve. The end result can be in one of a few flavours:
  1. Method call replaced with a call to a JVM runtime method: E.g. System.arrayCopy is replaced with a call to a method stub generated by the runtime for all array types. This method call is not a JNI call, but it is a static method call that is not inlined.
  2. Method call replaced with one or more instructions inlined: E.g. Unsafe.getByte/compareAndSet/Math.max
  3. Method call replaced with compiler IR implementation: E.g. java.lang.reflect.Array.getLength
  4. A mix of the above: E.g. String.equals is partially implemented in IR, but the array comparison is a call to a method stub.
The intrinsics are all set up in vmSymbols.hpp and if you look, you'll see Arrays.fill is NOT on the list. So why am I talking about Chewbacca? Because it is something like an intrinsic...

The Arrays.fill SIMD Opportunity

Arrays.fill is the Java memset (fills an array with a given value), and just like System.arrayCopy (memcpy in C lingo) is worth the effort to optimize and offers the same kind of opportunity. What opportunity might that be, you ask? the opportunity to use SIMD (Single Instruction Multiple Data) instructions when the underlying CPU offers them (I assume for the sake of discussion AVX enabled CPUs i.e. since Sandy Bridge, I find this listing of intel intrinsics useful to explain and sort through the available instructions). These instructions allow the CPU to operate on up to 256 bit (512 bit soon) chunks of data, thus transforming 32 byte sized MOV instructions into a single wide MOV instruction (E.g. the intel C instrinsic  _mm256_storeu_si256 or the corresponding instruction vmovdqu). SIMD instructions are good for all sorts of operations on vectors of data, or arrays, which is why the process of transforming element by element operations into SIMD instructions is also referred to as vectorization.
The actual assembly stub is generated dependent on CPU and available instruction set. For x86 the code is generated by the macroAssembler_x86.cpp, and the observant digger into the code will find it makes use of the widest memory instructions it can identify the processor is capable of. Wider is better baby! If you are not morbidly curious about what the implementation looks like, skip the next wall of assembly and you'll be back in Java land shortly.
Here's what the assembly boils down to when UseAVX>=2/UseSSE>=2/UseUnalignedLoadStores=true:

Roughly speaking the algorithm above is:
  1. Fill up an XMM register with the intended value
  2. Use the XMM register to write 64 byte chunks (2 vmovdqu) until no more are available
  3. Write leftover 32 byte chunk (skipped if no matching leftovers)
  4. Write leftover 8 byte chunks (skipped if no matching leftovers)
  5. Write leftover 4 bytes (skipped if no matching leftovers)
  6. Write leftover 2 bytes (skipped if no matching leftovers)
  7. Write leftover 1 bytes (skipped if no matching leftovers)
It ain't nice, but we do what we gotta for performance! There are variations of the above described across the internets as the done thing for a memset implementation, this might seem complex but is pretty standard... anyway, moving right along.

The Arrays.fill 'intrinsic'

Arrays.fill is different from System.arrayCopy because, as it's absence from vmSymbols suggests, it's not a method substitution kind of intrinsic (so technically not an intrinsic). What is it then? Arrays.fill is a code pattern substitution kind of compiler shortcut, basically looking for this kind of loop:
And replacing it with a call into the JVM memset implementation (I recently learnt the same thing is done by GCC as well, see code to assembly here). The pattern matching bit is done in loopTransform.cpp. This feels enough like an intrinsic grey area that the method doing the pattern match and replace is called intrinsify_fill.
Pattern matching makes this optimization potentially far more powerful than method substitution as the programmer doesn't have to use a special JDK method to convey meaning, they can just express their meaning in code and the compiler 'knows' that this simple loop means 'fill'. Compare that with System.arrayCopy where rolling your own leads to performance that is much worse than that offered by the intrinsic.
Let's prove me right (my favourite thing, beats kittens and all that crap), here's a JMH (see the JMH reference page for more JMH info/examples) benchmark comparing Arrays.fill to a hand rolled fill, and System.arrayCopy to handrolled array copy:
And the results are (Oracle JDK8u40/i7-4770@3.40GHz/Ubuntu, array is 32K in size)?
ArrayFill.fillBytes                561.540 ± 10.814 ns/op
ArrayFill.manualFillBytes          557.901 ± 5.255  ns/op
ArrayFill.manualReversedFillBytes 1017.856 ± 0.425  ns/op
ArrayFill.copyBytes               1300.313 ± 13.482 ns/op
ArrayFill.manualCopyBytes         1477.442 ± 13.030 ns/op

We can verify that the call out to the JVM fill method happens for fillBytes/manualFillBytes by printing out the assembly:

So what have we learnt so far:
  • Use System.arrayCopy, it is better than your handrolled loop. But surprisingly not hugely better, hmmm.
  • You don't have to use Arrays.fill, you can roll your own and it works the same. Notice the call out to the fill method. But...
  • Don't get too creative rolling your own. If you get too funky (like filling the array backwards) it'll fall apart and the 'intrinsic' won't happen. But do note that the reverse fill still has some of that good SIMD stuff going, we'll get to that in a sec.

Are The Other Types Filling The Love?

It all sounds great don't it? Let's see how this pans out for other types. We'll be filling an array of 32KB. To be uniform across data types that means a 16K chars/shorts array, an 8K ints/floats array and a 4K array of longs. I added an 8K array of objects, which is the same size for compressed oops on the Oracle JVM (reference size is 4 bytes, same as an int).
The JMH benchmark code is as you'd expect:
Here's some reasonable expectations:
  • If no optimizations are present, wider writes are more efficient. It follows that the longFill would be the fastest. But...
  • Given a clever compiler the fill loop is replaced with the widest writes possible, so there should be no significant difference. But the fill optimization does not cover double/long/object arrays, so we might expect longFill to be the worst performer.
  • An objects array is not that different from an int array, so performance should be similar. Sure there's a write barrier, but it need only be done once per card (not once for the whole array as I thought initially, god bless Shipilev and PrintAssembly), so that's an extra byte write per card of elements filled. A card is per 512 bytes, each element is 4 bytes, so that's one card per 128 elements. Given there is no fill method implemented for it we may expect it to be slightly worse than the longFill.
  • We should not rely on expectations, because performance is best measured.
As you'd expect the results are somewhat different than the expectations (Oracle JDK8u40/i7-4770@3.40GHz/Ubuntu):
ArrayFill.fillBytes     561.540 ± 10.814  ns/op
ArrayFill.fillChars     541.901 ±  4.833  ns/op
ArrayFill.fillShorts    532.936 ±  4.508  ns/op
ArrayFill.fillInts      543.165 ±  3.788  ns/op
ArrayFill.fillFloats    537.599 ±  2.323  ns/op
ArrayFill.fillLongs     326.770 ±  3.267  ns/op
ArrayFill.fillDoubles   346.840 ±  5.786  ns/op
ArrayFill.fillObjects  4388.242 ± 11.945  ns/op

Say WOT?
For bytes/chars/shorts/ints/floats Arrays.fill performs very similarly. This much is as expected from the second point above. But filling an array of longs/doubles is better than the others. The funny thing is, there's no fill function implemented for the long array, how come it is so darn quick? Also, why does the objects fill suck quite so badly when compared with the rest (I will not be addressing this last question! I refuse! this post is too fucking long as it is!)?
This is what happens when we turn off the OptimizeFill flag:
ArrayFill.fillBytes    1013.803 ± 0.227  ns/op
ArrayFill.fillChars     323.806 ± 3.879  ns/op
ArrayFill.fillShorts    323.689 ± 4.499  ns/op
ArrayFill.fillInts      326.336 ± 1.559  ns/op
ArrayFill.fillFloats    319.449 ± 2.048  ns/op
ArrayFill.fillLongs     328.692 ± 3.282  ns/op
ArrayFill.fillDoubles   345.035 ± 6.362  ns/op
ArrayFill.fillObjects  4397.130 ± 7.161  ns/op

Strange innit? now we got char/int/long arrays all performing similarly. In fact, with the exception of the byte array, everything is better than it was with the optimization.


Superword to the rescue! 

Turns out the JIT compiler is clued up on the topic of SIMD parallelisation by way of Superword Level Parallelism (see the original paper here):
In some respects, superword level parallelism is a restricted form of ILP (Instruction Level Parallelism). ILP techniques have been very successful in the general purpose computing arena, partly because of their ability to find parallelism within basic blocks. In the same way that loop unrolling translates loop level parallelism into ILP, vector parallelism can be transformed into SLP. This realization allows for the parallelization of vectorizable loops using the same basic block analysis. As a result, our algorithm does not require any of the complicated loop transformations typically associated with vectorization. In fact, vector parallelism alone can be uncovered using a simplified version of the SLP compiler algorithm.
...
Superword level parallelism is defined as short SIMD parallelism in which the source and result operands of a SIMD operation are packed in a storage location.
...
Vector parallelism is a subset of superword level parallelism.
The Hotspot compiler implements SLP optimizations in superword.cpp and you are invited to dive into the implementation if you like. I'm going to focus on it's impact here, and to do that I only need to know how to turn it on and off (core competency for any software person). It's on by default, so above results are what happens when it is on, here's what life looks like when it is off too (so -XX:-OptimizeFill -XX:-UseSuperWord):
ArrayFill.fillBytes   8501.270 ±  2.896  ns/op
ArrayFill.fillChars   4286.161 ±  4.935  ns/op
ArrayFill.fillShorts  4286.168 ±  3.146  ns/op
ArrayFill.fillInts    2152.489 ±  2.653  ns/op
ArrayFill.fillFloats  2140.649 ±  2.587  ns/op
ArrayFill.fillLongs   1105.926 ±  2.228  ns/op
ArrayFill.fillDoubles 1105.820 ±  2.393  ns/op
ArrayFill.fillObjects 4392.506 ± 11.678  ns/op


Life is revealed in all it's sucky splendour! This is what happens when the compiler shows you no love... did I say no love? hang on, things can get a bit worse.

Detour: Unsafe? We don't serve you kind here

To all the Unsafe fans, I got some sad news for y'all. Unsafe 'fill' loops are not well loved by the compiler. This is the price of stepping off the beaten path I guess. Consider the following benchmark:
The results are:
ArrayFill.unsafeFillOffheapBytes  9742.621 ±  2.270  ns/op
ArrayFill.unsafeFillOnHeapBytes  12640.019 ±  1.977  ns/op
ArrayFill.fillBytes(for reference) 561.540 ± 10.814 ns/op

The Unsafe variant do not enjoy the 'fill' pattern matching magic, nor do they get the SuperWord optimizations. What can you do? For this kind of thing you should use the Unsafe.setMemory method instead:
With the result:
ArrayFill.unsafeSetOffheapBytes   1259.281 ± 21.294  ns/op
ArrayFill.unsafeSetOnHeapBytes    1275.158 ± 27.950  ns/op
Not quite there, still ~2x worse (why? how come it doesn't just call the bytes fill method? a bit of digging shows it ends up calling the underlying platform's memset...) but beats being 20-25x worse like the handrolled method is.

Summary and Musings

It's the circle of life!
So what did we learn:
  • There's another kind of 'intrinsic' like optimization, which uses pattern matching to swap a block of code rather than a method. This is employed for memset like memory fill loops (in particular Arrays.fill) intrinsicfication. It's not an intrinsic technically, but you know what I fucking mean. 
  • System.arrayCopy/Arrays.fill implementations utilize SIMD instructions to improve their efficiency. These instructions are not available in plain Java, so some compiler intervention is required.
  • The JIT compiler is also able to use SuperWord Level Parallelism to derive SIMD code from 'normal' sequential code.
  • In the case of Arrays.fill, it looks like the SuperWord optimized code is faster than the fill specialized implementation for all types except bytes (on the system under test)
  • If you use Unsafe you will be excluded from these optimizations.
So I look at this process and I imagine history went something like this:
We want to use SIMD instructions, but the JIT compiler isn't really clever enough to generate them by itself. Memset implementations are rather specialized after all. Let's make life a bit easier for the compiler by creating an intrinsic. We'll even go the extra mile and make an effort to automatically identify opportunities to use this intrinsic, so now it's not really an intrinsic any more. The Arrays.fill optimization is available on Oracle JDK6u45 (the oldest I keep around, maybe it was there a while before that) and on that JVM it is twice as fast as the SLP generated code.
Over time, SLP gets better and eventually the compiler is now good enough to optimize the fill loop by itself and beat the specialized method. That is an awesome thing. We just need to remove the training wheels now.
And there's a final punch line to this story. Memset/Memcpy are such common and important opportunities for optimization, so Intel has decided to offer an assembly 'recipe' for them and save everyone the effort in writing them:
3.7.6 Enhanced REP MOVSB and STOSB operation (ERMSB)
Beginning with processors based on Intel microarchitecture code name Ivy Bridge, REP string operation using MOVSB and STOSB can provide both flexible and high-performance REP string operations for soft- ware in common situations like memory copy and set operations. Processors that provide enhanced MOVSB/STOSB operations are enumerated by the CPUID feature flag: CPUID:(EAX=7H, ECX=0H):EBX.ERMSB[bit 9] = 1. - [From the Intel Optimization Manual(September 2014)]
From the manual it seems that this method of implementing memcpy/memset can perform well, but like anything else, YMMV (the intel manual discussion of the performance differences is in itself interesting both on the results and the methodology level). One obvious advantage of this method is that it results in much much smaller code that should be trivial to inline into callers. This will however put the SuperWord method at a slight disadvantage, and the tide will change again.
[UPDATE 14/03/2015: It seems the good folks of Oracle have considered and rejected the use of REP MOVSB for array copy.]
Thanks go to the kind reviewers Peter 'Massive' Hughes, Darrach and the Shipster

Wednesday, 11 March 2015

Correcting YCSB's Coordinated Omission problem

YCSB is the Yahoo Cloud Serving Benchmark(also on wiki): a generic set of benchmarks setting out
The Nimbus Cloud Serving Board
to compare different key-value store providers under a set of loads:
The goal of the Yahoo Cloud Serving Benchmark (YCSB) project is to develop a framework and common set of workloads for evaluating the performance of different "key-value" and "cloud" serving stores.
The code is open for extension and contribution and all that good stuff, you can get it here. And it has become tool for comparing vendors in the NoSQL space. The benchmarks set out to measure latency and throughput. The terms are not directly defined in the paper, but the following statement is made:
The Performance tier of the benchmark focuses on the latency of requests when the database is under load. Latency is very important in serving systems, since there is usually an impatient human waiting for a web page to load. [...] Typically application designers must decide on an acceptable latency, and provision enough servers to achieve the desired throughput while preserving acceptable latency. [...] The YCSB Client allows the user to define the offered throughput as a command line parameter, and reports the resulting latency, making it straightforward to produce latency versus throughput curves.
What could possibly go wrong?™
It can go this wrong for instance, order of magnitude difference in results for different percentiles, leading to some poor decision making on how much hardware you'll need, leading to getting fired from your job and growing old bitter and twisted mumbling to yourself as you get drunk on the street corner until you freeze to death on a winter night. So potentially this is a risk to your future well being, listen up!


It's broken? Coordinated WOT?

When you measure latency bad,
Mr. Tene is sad
 :(
My colleague at Azul, Gil Tene, the magnificent, glorious, multi-dimensional, coding CTO, officer and gentleman (that's my bonus sorted) has been doing a lot of latency related preaching and teaching in the last few years. He has given the following talks at any number of conferences, but if you happened to have missed them, watch them NOW:
In particular he has coined the term "Coordinated Omission" (see raging discussion on Mechanical Sympathy) to describe instances in which the measuring/monitoring system coordinates measurement with the system under test/measurement such that samples are biased. This issue manifests in many load generating frameworks where the call into the system under test is done synchronously and the measurement thread holds up the next sample while the call is ongoing. This enables the system under test to delay requests that would have been made during the synchronous call thus skewing the sample set. Consider for example a system where:
  • We set out to measure a request every 1ms (from a single thread, synchronously)
  • The first 500 calls come back in 100µs each (so call K starts at Kms and returns at Kms + 100µs )
  • Call 501 takes 500 milliseconds (starts at 500ms, returns at 1 second)
  • Call 502 takes 100µs
See the problem?
The problem is that call 502 did NOT happen at it's designated time, and saying it took 100µs fails to capture this. It failed the assumptions laid out in the first sentence because we were blocked for 500ms. If we were to stick to our original schedule we would be making calls 502 to 1000 in the time it took for call 501 to execute. How should we treat this departure from plan?
Is it safe to come out yet?
  1. Ignore it! it will go away by itself! - This is the coordinated omission way. We are now reporting numbers that are no longer according to the test plan, which means that our "latency versus throughput curves" are off the mark. This is a very common solution to this issue.
  2. Fail the test, we wanted a call every ms and we didn't get that - This is an honest hardline answer, but it potentially throws the baby with the bath water. I think that if you set out to schedule 1000 calls per second you might want to see how often this falls apart and how. But this answer is The Truth™, can you handle it? If one is to start from scratch and write their own load generator I propose a read of the Iago load test framework philosophy page: "Iago accurately replicates production traffic. It models open systems, systems which receive requests independent of their ability to service them. Typical load generators measure the time it takes for M threads to make N requests, waiting for a response to each request before sending the next; if your system slows down under load, these load testers thus mercifully slow down their pace to match. That's a fine thing to measure; many systems behave this way. But maybe your service isn't such a system; maybe it's exposed on the internet. Maybe you want to know how your system behaves when N requests per second come in with no "mercy" if it slows down.". This is a fine sentiment.
  3. Coordinated Omission correction: Adjust the results to reflect the expected call rate. This can be done in a straight forward manner if the 'missing' calls are added back with a latency which reflects the period for which they were delayed. This correction method is supported out of the box by HdrHistogram but the discussion regarding it's over or under estimation of the impact of the delay is outside the scope of this post.
  4. Coordinated Omission avoidance: Measure all calls according to original schedule. We are now saying: "If I can't make the call, the meter is still running!". This is particularly relevant for systems where you would typically be making the requests to the system under test from a thread pool. That thread pool would be there to help you support asynchronous interaction where the API failed in giving you that option. Like JDBC... Like many key-value pair provider APIs.
This last solution is the one we'll go for in this post, but I would urge you to consider the results critically. In particular if you are trying to simulate independent access to a web server (as opposed to a DB via a thread pool) then the adherence to schedule might be hugely optimistic of the results in your case. This is because failing to generate independent load may have all sorts of beneficial effects on the system under test.
For the YCSB benchmark I'm assuming the harness/load generator is simulating a web serving layer accessing the key-value store in an effort to serve an unbounded, uncoordinated user request load via a predefined thread pool. So it's door number 3 for me. The corrected load generator is here.


Step 0: Some preliminary work (not strictly required)

HdrHistogram, as approved by Freddie!
As described previously here, we should all just get on with capturing latency using HdrHistograms. So as a first step toward correcting YCSB I have gone in and added an HdrHistogram measurement container. This is pretty straight forward as all I needed to modify was the Measurements class to allow a new measurement type. While I was there I tweaked this and that and the following list of changes to that class emerged:
  1. Add new measurement type and corresponding command-line option("-p measurementtype=hdrhistogram")
  2. Add combined measurement option allowing old/new measurement side by side: "hdrhistogram+histogram"
  3. Add support for capturing both corrected and uncorrected measurements for the same run.
  4. Use CHM instead of synchronizing around a HashMap.
The new measurement type supports logging loss less HdrHistogram data to a file (controlled by the hdrhistogram.fileoutput=<true|false> option and the hdrhistogram.output.path=<path> option) as well as better precision percentile data and lock free logging of latencies. This is not very interesting work but if you are interested in the "How would I plug in my own data structure to capture latency into YCSB?" topic have fun. It was not necessary for correction but it was good to do so that better quality results can be observed. You're welcome.


Step 1: Demonstrate the issue

YCSB includes a very useful means of verifying the measurements in the form of a mock DB driver. This means we can test our assertions regarding coordinated omission without setting up a key value store of any kind. The mock DB is called BasicDB and is the default DB used. We can configure it to simulate a pause and see what happens (-p basicdb.verbose=false -p basicdb.simulatedelay=4  will make the mock DB stop logging every action and simulate a latency of 0-4ms for each action). I added a further option to the BasicDB which allows us to turn off the randomization of the delay (-p basicdb.randomizedelay=false).
Let's consider our expectations in the case where a DB simply cannot handle request quickly enough. We can setup an experiment with the following settings: -target 1000 -threads 1 -s -p status.interval=1 -p workload=com.yahoo.ycsb.workloads.CoreWorkload -p basicdb.verbose=false -p basicdb.simulatedelay=4 -p basicdb.randomizedelay=false -p measurementtype=hdrhistogram -p maxexecutiontime=60
Here's what they all mean:
  • -target 1000 -> We aim to test 1000 requests per second
  • -threads 1 -> We have a single client thread
  • -s -p status.interval=1 -> We will be printing out status every second (I made the status interval configurable)
  • -p basicdb.verbose=false -p basicdb.simulatedelay=4 -p basicdb.randomizedelay=false -> The DB will sleep 4ms on each request, so the maximum we can hope for is 250, no noisy printing per operation please
  • -p measurementtype=hdrhistogram -> Use HdrHistogram to capture the latencies
  • -p maxexecutiontime=60 -> Run for one minute, then exit and print summary
This DB is obviously failing, it can't keep up with the rate of incoming requests and according to our model they queue up. The time measured per call is reflected in the summary for the READ operations:
[READ], Operations, 12528.0
[READ], AverageLatency(us), 4477.102809706258
[READ], MinLatency(us), 4018.0
[READ], MaxLatency(us), 44703.0
[READ], 95thPercentileLatency(ms), 4.0
[READ], 99thPercentileLatency(ms), 4.0
[READ], Return=0, 12528
But this completely ignores the time spent on the queue. If we were measuring according to schedule we'd get the following set of latencies:
Latency[k] = 4 + 3*(k-1) ms
The max latency would be for the last request to get in. We ran for 60 seconds, at 250 requests/sec which means our last request was (k=15000) and had a latency of 45 seconds when measured from the time we intended to make it. This number reflects the system's failure to handle load far more correctly than the numbers quoted above.


Step 2: Working to Schedule

The YCSB load generator has a weak notion of schedule, in the sense that it opts for option number 1 above and will just execute the operations when it can. When faced with the task of correcting this kind of issue (in a pile of foreign code) we need to look for 2 things in the load generator:
  1. "Scheduling an action to run at time X" - This will involve some calls to one of the many scheduling facilities in the JDK:
    1. Thread.sleep is an old favourite, but TimeUnit also supports a sleep method. A search for sleep in the code base will cover both. This is what YCSB was using to schedule next event to fire.
    2. Code submitting tasks to java.util.Timer, or alternatively the ScheduledExecutorService
    3. Code using LockSupport.parkNanos
    4. Object.wait(...)
    5. others?
  2. "Measuring the length of an operation" - This will involve calls to System.nanoTime() or currentTimeMillis(). For YCSB this is found to happen for example here.

To correct this problem I had to introduce the concept of 'intended start time' to the operations measurement. Schedule for YCSB is specified by the -target command line option which sets the overall number of operations per second to be attempted by the load generator. This is optional, and the default is to go as fast as you can manage, i.e. with no schedule but the back pressure from the system under test to guide us. I'm not sure what a good assumed rate of requests is reasonable in this case, so I did not correct this case. NOTE: If you don't specify target no correction will take place.
The target parameter is translated to a per-thread operation rate (number of threads is set via the threads option, default is 1) so if we have 10 threads, and the target request rate is 1000 (-target 1000 -threads 10) we will have each thread hitting the store with 100 requests per second. The client threads randomize the first operation time to avoid all hitting the store on the same interval. I did some ground work here by setting the units across the board to nanoseconds and naming interval parameters appropriately, nothing too exciting.
The actual correction at it's core involves:
  1. Record the operation's intended start time
  2. Use the intended start time when computing latency
Sadly the way YCSB measures latency does not lend itself to a simple in place fix. The operations are scheduled by the ClientThread which calls into a workload, calling into a DB, which is actually the DBWrapper which measures the latency (for calling into an actual DB implementation) and reports it to the central Measurements singleton. This means that changing the Workload/DB API to include a startTime parameter to each call is quite a far reaching change which would require me to dig through all the DB drivers implementations and would result in a very unpleasant time for all.
I settled on using a thread local on the Measurements object to transfer the start time to the DBWrapper, it is not a nice way to do things (and I'm happy to hear better suggestions) but it does the job without modifying the API.
Once we have:
  1. ClientThread setting up the start time for the operation via Measurements
  2. DBWrapper using the start time from Measurements to compute the operation latency
That's pretty much it. For extra points I wanted to include some facilities to compare measurements before/after the change. These can be removed if we accept HdrHistogram as a replacement and if we accept we only want to measure the intended latency, which would result in a much smaller PR.


Step 3: is the issue solved?

Running the setup from step 1 such that it produces the intended latency as well as the original measurement side by side(-p measurement.interval=both) yields the following result for the READ operations:
[READ], Operations, 12414.0
[READ], AverageLatency(us), 4524.981069759949
[READ], MinLatency(us), 4018.0
[READ], MaxLatency(us), 24703.0
[READ], 95thPercentileLatency(ms), 4.0
[READ], 99thPercentileLatency(ms), 4.0
[READ], Return=0, 12414
[Intended-READ], Operations, 12414.0
[Intended-READ], AverageLatency(us), 2.359010991606251E7
[Intended-READ], MinLatency(us), 4256.0
[Intended-READ], MaxLatency(us), 4.6989311E7
[Intended-READ], 95thPercentileLatency(ms), 42369.0
[Intended-READ], 99thPercentileLatency(ms), 46530.0

This reflects the effect a backed up system would have on latency as we express in Step 1 above. It's actually a bit worse because the average cost of calling the mock DB with a sleep of 4ms is 4.5ms. As we can see the maximum latency is 46.9 seconds, reflecting the fact that the last read to execute was scheduled to hit the system 13.1 seconds into the run.


Step 4: The limitations of the harness

We can now also consider the perfect DB for the sake of observing the short comings of the test harness by setting the mock DB delay to 0(-p basicdb.simulatedelay=0):
[READ], Operations, 56935.0
[READ], AverageLatency(us), 0.01796785808377975
[READ], MinLatency(us), 0.0
[READ], MaxLatency(us), 49.0
[READ], 95thPercentileLatency(ms), 0.0
[READ], 99thPercentileLatency(ms), 0.0
[READ], Return=0, 56935
[Intended-READ], Operations, 56935.0
[Intended-READ], AverageLatency(us), 232.37026433652412
[Intended-READ], MinLatency(us), 0.0
[Intended-READ], MaxLatency(us), 39007.0
[Intended-READ], 95thPercentileLatency(ms), 0.0
[Intended-READ], 99thPercentileLatency(ms), 0.0

How come it take so long to measure a noop? why such large differences? Here's some generic theories and how they panned out:
  • The JVM running the load generator is running with suboptimal settings(-Xms64m -Xmx64m -XX:+PrintGCDetails -XX:+PrintGCApplicationStoppedTime, Oracle JDK8u31) on a busy Mac laptop running on battery
This is no way to benchmark anything, but the interesting thing is that if we have no schedule to stick to the test harness is willing to just ignore the issue. If we run on a decent machine (with a decent OS) we get nicer results. This is from a server class machine running CentOS6.3/OracleJDK8u25 with same settings:
[READ], Operations, 56930.0
[READ], AverageLatency(us), 0.44417705954681186
[READ], MinLatency(us), 0.0
[READ], MaxLatency(us), 20.0
[READ], 95thPercentileLatency(ms), 0.0
[READ], 99thPercentileLatency(ms), 0.0
[READ], Return=0, 56930
[Intended-READ], Operations, 56930.0
[Intended-READ], AverageLatency(us), 146.31262954505533
[Intended-READ], MinLatency(us), 15.0
[Intended-READ], MaxLatency(us), 14255.0
[Intended-READ], 95thPercentileLatency(ms), 0.0
[Intended-READ], 99thPercentileLatency(ms), 0.0
This is still significant.

  • The JVM suffers from warmup related artefacts
This certainly correlated to the max values I'm seeing here. When looking at the status line for the first second I see:
[READ: Count=22, Max=14, Min=0, Avg=0.64, 90=0, 99=14, 99.9=14, 99.99=14]
[Intended-READ: Count=23, Max=14255, Min=15, Avg=5563.39, 90=13287, 99=14255, 99.9=14255, 99.99=14255]
 But after a few seconds the process settles and we see much better results, this is typical:
[READ: Count=947, Max=14, Min=0, Avg=0.02, 90=0, 99=0, 99.9=2, 99.99=14]
[Intended-READ: Count=946, Max=194, Min=61, Avg=151.29, 90=165, 99=175, 99.9=186, 99.99=194]
A good way to handle this issue is by relying on the HdrHistogram output files to grab relevant time periods for analysis. With the original YCSB output we have the percentile summary data, but this is not something we can combine for analysis. With the loss-less interval histogram logs we can look at any sub-period(which is longer than one interval, but shorter than the whole run) and get accurate full range histogram data. A common practice is to discard warmup period results, I'm no a fan of throwing away data, but since this is the load generator warmup I'd think it's quite legitimate. It's perhaps an interesting feature to add to such a framework that the framework can be warmed up separately from the system to examine cold system behaviour.
  • Thread.sleep/LockSupport.parkNanos are not super accurate and may wakeup after the intended operation start time
I've added an option for spinning instead of sleeping (so burn a CPU). This has improved the average value dramatically from ~146µs to ~3.1µs. A typical status line now looks like:
[READ: Count=947, Max=13, Min=0, Avg=0.02, 90=0, 99=0, 99.9=3, 99.99=13]
[Intended-READ: Count=948, Max=57, Min=0, Avg=0.47, 90=1, 99=12, 99.9=26, 99.99=57]
It is obviously not desirable for the load generator to burn a CPU instead of using sleep, but it also introduces scheduling inaccuracies. This is an accuracy issue we didn't have to deal with when not measuring from a schedule. This didn't impact the measured outliers, but has dramatically reduced their number. The take away here is just that there are accuracy limitations to the load generators ability to stick to schedule.

  • GC pauses that are large enough to derail the schedule on the load generator side are now captured. Unless the GC pauses happen inside the measurement gap we will have no idea we have gone off schedule if we don't track the intended start time.
We should capture GC logs on load generator side and make sure we correlate the GC events with recorded latencies. Here's a GC pause being captured by the corrected measurement:
[READ: Count=952, Max=0, Min=0, Avg=0, 90=0, 99=0, 99.9=0, 99.99=0] 
[Intended-READ: Count=952, Max=14, Min=0, Avg=0.03, 90=0, 99=0, 99.9=3, 99.99=14]
[GC (Allocation Failure) [PSYoungGen: 17895K->1824K(18944K)] 17903K->1840K(62976K), 0.0024340 secs] [Times: user=0.01 sys=0.01, real=0.01 secs]
Total time for which application threads were stopped: 0.0026392 seconds
[READ: Count=957, Max=0, Min=0, Avg=0, 90=0, 99=0, 99.9=0, 99.99=0]
[Intended-READ: Count=957, Max=2719, Min=0, Avg=5.21, 90=0, 99=0, 99.9=2119, 99.99=2719]
This process is running with a 64M heap, you can expect longer pauses as the heap grows (in particular as the young generation grows).

  • The operation setup time is now being measured as well as the operation itself.
When running with the spin option we can see the average operation cost is ~3.1µs, this is all test harness overhead and is really quite negligible in the context of network hopping operations. In other words, nothing to worry about for this harness but could well prove an issue for others.


Step 5: The Good, The Bad And The STW pausing DB

Many software processes have a latency profile that is far from normally distributed. To see what YCSB makes of this kind of profile now that we can compare corrected vs. uncorrected measurement I have built a mock DB that has 4 mods of latency (p is uniform random number [0,1]):
  1. Awesome (p < 0.9): we return in 200µs-1ms
  2. Good (0.9 < p < 0.99): we return in 1-10ms
  3. Minor Hiccup( 0.99 < p < 0.9999): we hit a bump, but only one thread is affected 10-50ms
  4. Major Hiccup(0.9999 < p): we hit a STW pause(because GC/THP/LBJ/STD/others), all threads stop for 50-200ms
I implemented the above with a read write lock, where the STW pause grabs the write lock and all the others grab the read lock. This is far from a perfect representation of a system (everyone waits for STW as intended, but also STW waits for everyone to start), but it will do. If you feel strongly that a better simulation is in order, write one and let's discuss in the comments!
What sort of profile will we see? How far off course will our measurements be if we don't stick to schedule? Here's this setup run at a rate of 10,000 requests per second, with 25 threads (so each thread is trying for 250 reqs/sec or 1 request per 4ms):
[READ], Operations, 569516.0
[READ], AverageLatency(us), 1652.1852871561116
[READ], MinLatency(us), 210.0
[READ], MaxLatency(us), 142463.0
[READ], 95thPercentileLatency(ms), 1.0
[READ], 99thPercentileLatency(ms), 19.0
[READ], Return=0, 569516

According to these numbers, the max is quite high but the overall impact of hiccups is not too severe (all depends on your requirements of course). Even at this stage we can see that the effect of global pauses is skewing the other measurements (if you hit a short operation while a STW pause is in progress you still have to wait for the STW event to finish).
The corrected measurements tell a different story:
[Intended-READ], Operations, 569516.0
[Intended-READ], AverageLatency(us), 24571.6025835973
[Intended-READ], MinLatency(us), 268.0
[Intended-READ], MaxLatency(us), 459519.0
[Intended-READ], 95thPercentileLatency(ms), 83.0
[Intended-READ], 99thPercentileLatency(ms), 210.0


How can this be right? Can this be right?
  • At a rate of 10000 request per second, the unlikely Major Hiccup is likely to happen every second. Consider this next time someone tells you of a 99.99%ile behaviour. Given an event rate of 10K per second, 99.99% is suddenly not very rare. Consider that at this rate there's likely to be a few events that are worse.
  • The average major hiccup is 125ms long, in this time 125/4 events are delayed on all 25 threads -> 125 * 25 / 4 = 781 events are delayed from starting, they will further delay each other as they execute. In roughly 12 seconds we can see how it is quite probable that one of these events is another major hiccup. What with all the queuing up behind the first one etc, the pileup becomes quite reasonable.
  • The probability of a 'mode' is not the probability of the per event latency once STW and queuing effects are in play.
I've made the mock DB print out 'OUCH' every time we get slapped with a STW event. It turns out that we got very unlucky in this run and hit three of these in a row:
56 sec:
[READ: Count=9192, Max=83903, Min=238, Avg=1745.13, 90=1531, 99=26959, 99.9=79551, 99.99=83775]
[Intended-READ: Count=9208, Max=159999, Min=303, Avg=16496.92, 90=54271, 99=103807, 99.9=150527, 99.99=158335]
OUCH
OUCH
OUCH
57 sec: 
[READ: Count=9642, Max=129727, Min=247, Avg=2318, 90=1799, 99=40607, 99.9=125631, 99.99=127359]  
[Intended-READ: Count=9635, Max=459519, Min=320, Avg=102971.39, 90=200319, 99=374271, 99.9=442367, 99.99=457983]

This is quite telling.
The view on what's the worst second in this run is wildly different here. Because the uncorrected measurement takes each event as it comes it will take the view that 75 events were delayed by these hiccups, and none by more than 130ms. But from the corrected measurement point of view all the queued up measurements were effected and were further delayed by each other.
I've re-run, this time logging interval histograms in their compressed form for every second in the run. Logging a 60 seconds run with 1 second interval data cost me 200k (we can tweak the construction in OneMeasurementHdrHistogram to minimize the cost). I can take the compressed logs and use the HistogramLogProcessor script provided with HdrHistogram to process the logs (you need to copy the HdrHistogram.jar into the script folder first). Running:
./HistogramLogProcessor -i READ.hdr -o uncorrected -outputValueUnitRatio 1000
./HistogramLogProcessor -i Intended-READ.hdr -o corrected -outputValueUnitRatio 1000
Will produce *.hgrm files for both. I then use the plotFiles.html to generate the following comparison:


They tell very different stories don't they.
The red line will have you thinking your system copes gracefully up to the 99%ile slowly degrading to 20ms, when measuring correctly however the system is shown to degrade very quickly with the 20ms line crossed as early as the median, and the 99%ile being 10 times the original measurement. The difference is even more pronounced when we look at one of those terrible seconds where we had back to back STW hiccups. I can use the HistogramLogProcessor script to produce partial summary histograms for the 3 seconds around that spike:
./HistogramLogProcessor -i Intended-READ.hdr -o correctedOuch3 -outputValueUnitRatio 1000 -start 1425637666.488 -end 1425637668.492
Similarly we can compare a good second with no STW pauses:


Summary

Coordinated Omission is a common problem in load generators (and other latency reporters), we had a look at fixing YCSB, an industry standard load generator:
  • Replaced the data structure used to capture latency with HdrHistogram: that is just generally useful and gives us better data to work with when examining the corrected measurement
  • Found scheduling code and introduced notion of operation start time.
  • Found measuring code and captured both operation cost (uncorrected measurement) and scheduled time latency (corrected measurement).
  • Use a mock system under test to evaluate measurement of known scenario. This is a very handy thing to have and luckily YCSB had this facility in place. In other places you may have to implement this yourself but it's a valuable tool to have in order to better understand the measurement capabilities of your harness. This helped highlight the scale of scheduling inaccuracies and test harness overhead per operation, as well as the scale of test harness error during its own warmup period.
  • Use HdrHistogram facilities to visualise and analyse latency histogram data from the compressed histogram logs.
Thanks goes to this posts kind reviewers: Peter Huges, Darach, and Philip Aston


Monday, 16 February 2015

HdrHistogram: A better latency capture method

Some years back I was working on a latency sensitive application, and since latency was sensitive it was a requirement that we somehow capture latency both on a transaction/event level and in summary form. The event level latency was post processed from the audit logs we had to produce in any case, but the summary form was used for live system monitoring. We ended up with a solution I've seen since in many places (mild variations on a theme), which is what I've come to think of as the linear buckets histogram:

The above data structure was an easy solution to a problem we had little time to solve, but it left much to be desired.
These days the histogram problem is gloriously solved by the HdrHistogram (High Dynamic Range), and though it's been around for a couple of years now I still find olden hand rolled histograms in many a code base. Cut that shit out boys and girls! Let me show you a better way :-)

HdrHistogram highlights:
  • Mature and battle tested, this data structure has been in production for many companies for a while now. Kinks have been unkinked and knickers untwisted.
  • Multi-lingual support: current implementations available in Java, C, C#, Erlang, Go and more are on the way.
  • Auto-resizing histograms (if you exceed your initial miximum value estimate)
  • Compact memory footprint supporting high precision of values across a wide range.
  • Compressed lossless serialization/de-serialization
  • Plotting scripts for gnuplot, a webby charting tool and an excel sheet chart
  • Lock-free concurrency support for recording and logging from multiple threads
  • Zero allocation on recording path (unless resizing which is optional, and then only if value exceeds initially specified max recorded value)
  • Constant time measurement which is less than cost of calling System.nanoTime() (on the cost, scalability and trustworthiness of nanoTime read Shipilev's excellent report)
It is truly as if the bees-knees and the dogs-bollocks had a baby, is it not?

Mama, what's a histogram?

"Histogram of arrivals per minute" by DanielPenfield
Own work. Licensed under CC BY-SA 3.0 via Wikimedia Common
Well, you could look it up on wikipedia, but the short answer is that a histogram is a summary of data in terms of the frequency of ranges of values.  So given the following data set [1,2,3,...,100000] captured for a second I could summarize in several ways:
  • I could capture the whole range of values each value in a bucket of it's own, assigning each value a frequency of 1 per second.
  • I could have a single bucket for values between 0 and 100,000, this bucket will have the frequency of 100,000 times per second.
These are both a bit silly, the first is as bad as dealing with the full set of data, the second is telling us nothing about the way the 100,000 values break down within the range. Still these are the 2 extremes of histograms, the alternatives lie within that range in terms of the data they offer but there are many ways to skin a cat (apparently, :( poor cats):
  • Capture the values in 1000 buckets, each bucket representing a range of 100 values: [1..100][101..200]...[99,901..100,000] that will result in 1,000 buckets each with a frequency of 100. This is the sort of histogram described above where all buckets capture the same fixed range.
  • Capture the values in 17 buckets, each bucket K representing a range [2^K..(2^(K+1)-1)] e.g. [1..1][2..3][4..7]...[65,536..131,071]. This would be a good solution if we thought most values are likely to be small and so wanted higher precision on the lower range, with lower and lower precision for larger values. Note that we don't have to use 2 as the base for exponential histogram, other values work as well.
Both of the above solutions trade precision across a large range with storage space. In both solutions I am required to choose up-front the histogram precision and we expect to pay for a large range with either space or precision. Now that we realize what the variables are we can describe these solutions:
  • Linear buckets: For a range 0..R we will have to pay R/B space for buckets of range B. The higher R is the more space we require, we can compensate by picking a large B.
  • Exponential buckets: For a range 0..R we require space of log2 of R. The bucket size grows exponentially as we track higher values.
The problem we face with latency data points is that the range of values we want to capture is rather large. It is not unreasonable to expect the latency outliers to be several orders of magnitude larger than the typical observed measurement. For example, it may well be that we are timing a method whose cost is in the 100s of nanoseconds, or a high speed connection round trip on the order of 1-100µs but on occasion our latency is dominated by some stop the world GC pause, or network congestion, which is in the order of 10ms to a few seconds. How can we correctly size the range of our histogram? Given the possibility of multi-second GC delays we need to cover a typical range of 1000ns to 100,000,000,000ns. If we used a linear histogram with a 100µs bucket we'd need 1,000,000 buckets (assuming an int counter this will add up to a ~4MB data structure).
The HdrHistogram follows a different path to the above solutions and manages to accommodate a large range with a high precision across the range in a limited space.

How does it work?

Here's what the documentation has to say:
"Internally, data in HdrHistogram variants is maintained using a concept somewhat similar to that of floating point number representation: Using an exponent a (non-normalized) mantissa to support a wide dynamic range at a high but varying (by exponent value) resolution. AbstractHistogram uses exponentially increasing bucket value ranges (the parallel of the exponent portion of a floating point number) with each bucket containing a fixed number (per bucket) set of linear sub-buckets (the parallel of a non-normalized mantissa portion of a floating point number). Both dynamic range and resolution are configurable, with highestTrackableValue controlling dynamic range, and numberOfSignificantValueDigits controlling resolution."
Hmmm... I'll admit to having difficulty immediately understanding what's happening from the above text, precise though it may be. I had to step through the code to get my head around what works why, read the above again and let it simmer. I'm not going to dig into the implementation because, while interesting, it is not the point of this post. I leave it to the reader to pester Gil Tene(the author of HdrHistogram) with implementation related questions.
The principal idea is a mix of the exponential and linear histograms to support a dynamic range precision that is appropriate to the time unit scale. At the scale of seconds we have a precision of milliseconds, at the scale of milliseconds we have a precision of microseconds etc. This translates roughly into exponential scale buckets which have linear sub-buckets.

Example: From raw recording to histogram

I have posted a while back a Java ping utility which measures the round trip between minimal client and server processes. Each round trip was recorded into a large array and every set number of round trips the measurements were summarized in percentiles:
Recording raw data is the simplest way of capturing latency, but it comes at a price. The long[] used to capture the latencies is ~8MB in size, this is for a million samples and in a real application can grow without bounds until some cutoff point where we decide to summarize or discard the data. When we want to report percentiles we have to sort it and pick the relevant data points. This is not usually an acceptable solution (because of the memory footprint), but it offers absolute accuracy and is trivial to implement (until you have to consider serialization, concurrency and visualization, but otherwise trivial).
Replacing this measurement method with a histogram is straight forward:
This histogram is 31KB when using 2 decimal places precision which is good enough in most case (according to JOL, add it to your utility belt if it ain't there already. Increasing the precision to 3 increases the size to 220KB), which is a large improvement over 8MB. We could reduce the memory consumption further if we were willing to limit the maximum data point count per bucket and use an integer/short histogram(ints seem like a reasonable choice).
If we print both measurements methods for the same run we can see the difference between the raw data and the HDR representation which is naturally slightly less accurate (# lines are HdrHistogram,@ is for raw data, each line represents 1M data points):
We can see that reported percentiles are pretty close to the raw data:

  • Note that the nanoTime on Mac reports in µs granularity, which is why the real values(@ lines) all end with 3 zeros.
  • Note that the max/min reported are adjusted to the correct histogram resolution (not a big deal, but slightly surprising).

What would have happened with our hand rolled solution? To keep a range of 0 to 60,000,000,000ns in a linear histogram of the same memory footprint we would need to limit ourselves to roughly 220k/8=64k buckets. Each bucket would have a granularity of roughly 1ms which would have translated to a very limited visibility on the lower end of the spectrum as most data sets are actually all under 1ms. This would have also completely skewed our percentiles (i.e 99.99% results in 1ms bucket, no breakdown of behaviour in percentiles). We could try and tackle the issue by picking a lower range to cover (which if applied to HdrHistogram will minimize memory usage further) or by blowing the memory budget on finer grained buckets.

Example: Aeron samples, much percentiles! Such graphs! Wow!

Percentiles are commonly used to describe latency SLA of a given system and a typical application will report a range of percentiles to reflect the probability of a given response time. In this context we can say that a 99%ile latency of 1ms means that 99% of all requests were handled in =< 1ms.
Aeron is a low latency, reliable UDP messaging library. Since latency is an obvious concern, all the samples utilize the HdrHistogram lib to demonstrate measurement and report results, here are the relevant excerpts from the Ping sample:
This results in a pile of text getting poured into the console, not that friendly:
Fear not, HdrHistogram comes packed with a handy charty thingy! Here's what the above histogram looks like when plotted:

To get this graph:
  • Save output above to a text file
  • Open a browser, and go here (the same HTML is in the project here)
  • Choose your file, choose percentiles to report and unit to report in
  • Export the picture and stick it in your blog!
This histogram was provided to me by Martin Thompson (one of the Aeron developers) and is from a test run in a performance lab. We can see that Aeron is delivering a solid 7µs RTT up to the 90%ile where latency starts to gradually grow. In this particular data set the maximum observed latency was 38µs. This is a great latency profile. It is far more common for the max and 99.99%ile to be orders of magnitude more that the 90%ile.
I could similarly plot this histogram using a gnuplot script to be found here. The gnuplot script is very handy for scripted reporting on large runs. It also allows for plotting several histograms on the same graph to allow visual comparison between benchmark runs for instance.

Example: Compressed histogram logging

Because the SLA is often specified in percentiles, it is common for applications to log only percentiles and not histograms. This leads to a reporting problem as it turns out that percentiles output cannot be combined to produce meaningful average percentiles. The solution would be to log the full histogram data, but who want a log that grows by 31KB every 5 seconds just to capture one histogram? Worry not, HdrHistogram comes with a compressed logging format and log writer and all that good stuff:
How many bytes is that logged histogram costing us? The 31KB histogram compressed down to 1KB in my example, but the result of compression will depend on the histogram captured. It is fair to assume that histograms compress well as the array of buckets is full of zeros (on the byte level) as most buckets are empty or low count.
If 1KB sounds like allot consider that a days worth of 10s interval histograms will result in an 8MB log file, which seems pretty acceptable even if you have a hundred such files. The benefit is that you will now have high precision interval latency data that you can reliably use to create longer interval latency data. You can use the HistogramLogProcessor to produce a full or partial log summary histogram for plotting as above.
I believe there's some truly exciting data visualizations one could build on top of this data, but sadly thats not where my skills lie. If you got skills to show of in this area I'm sure HdrHistogram would value your contribution.

Example: jHiccup and concurrent logging

jHiccup is a pause measurement tool used to capture OS or JVM 'hiccups'. It deserves it's own post but I'll try and summarize it in a few points:
  • jHiccup runs a HiccupRecorder thread which sleeps for a period of time (configurable) and measures the delta between the wakeup time and actual time. The failure to be re-scheduled is taken as a potential OS/JVM hiccup. The size of the hiccup is recorded in a histogram.
  • jHiccup can be run as an agent in your own process, an external process, or both.
  • jHiccup has been ported to C as well.
  • People typically use jHiccup to help charecterize and diagnose disruptions to execution in their system. While not every hiccup is the result of a STW pause we can use the jHiccup agent evidence correlated with an external jHiccup process and the JVM gc logs to support root cause analysis. A significant hiccup is a serious sign of trouble meaning a thread was denied from scheduling for the length of the hiccup. We can safely assume in most cases that this is a sign that other threads were similarly disturbed.
Gil Tene originally wrote HdrHistogram as part of jHiccup, but as HdrHistogram turned out to be more generally useful the two were split. The reason I bring jHiccup up in this context is that it serves as a regularly maintained full blown real world example of using an HdrHistogram.
jHiccup has 2 interesting threads, with roles that parallel many real world applications out there:
  • The measuring thread(HiccupRecorder): This is the thread that sleeps and wakes up and so on. The rate at which it does that is potentially quite high and we don't want to skew to measurement by performing IO on this thread. Similarly many real world application will have critical threads where it is not desirable to introduce IO. Since this is the case actual persistence will be performed on another thread.
  • The monitoring/logging thread(HiccupMeter): This thread will wake up at regular intervals and write the last interval's histogram to the log file. But since it is reading a histogram while another thread is writing to the histogram we now need to manage concurrency.
HdrHistogram offers a synchronization facility to serve exactly this use case in the form of the Recorder:
  • Recording a value in the recorder is a wait free operation (on JDK8, can be lock free on older depending on the getAndAdd implementation for AtomicLongArray).
  • The Recorder also comes in a single-writer flavour, which minimizes the concurrency related overheads.

Under the covers the recorder is using an active histogram and an inactive one, swapped seamlessly when an interval histogram is requested. Using a Recorder looks much like using a normal histogram(full code here):
And that's concurrent logging sorted ;-).

Summary

With HdrHistogram now hitting version 2.1.4 and offering a wealth of tried and tested functionality along with cross platform implementations and a standardized compressed logging format it is definitely time you gave it a go! May your latencies always be low!
If you are looking for a pet project and have a gift for UI thingies a latency explorer for the histogram interval logs would be an awesome contribution!