Pages

Monday, 16 February 2015

HdrHistogram: A better latency capture method

Some years back I was working on a latency sensitive application, and since latency was sensitive it was a requirement that we somehow capture latency both on a transaction/event level and in summary form. The event level latency was post processed from the audit logs we had to produce in any case, but the summary form was used for live system monitoring. We ended up with a solution I've seen since in many places (mild variations on a theme), which is what I've come to think of as the linear buckets histogram:

The above data structure was an easy solution to a problem we had little time to solve, but it left much to be desired.
These days the histogram problem is gloriously solved by the HdrHistogram (High Dynamic Range), and though it's been around for a couple of years now I still find olden hand rolled histograms in many a code base. Cut that shit out boys and girls! Let me show you a better way :-)

HdrHistogram highlights:
  • Mature and battle tested, this data structure has been in production for many companies for a while now. Kinks have been unkinked and knickers untwisted.
  • Multi-lingual support: current implementations available in Java, C, C#, Erlang, Go and more are on the way.
  • Auto-resizing histograms (if you exceed your initial miximum value estimate)
  • Compact memory footprint supporting high precision of values across a wide range.
  • Compressed lossless serialization/de-serialization
  • Plotting scripts for gnuplot, a webby charting tool and an excel sheet chart
  • Lock-free concurrency support for recording and logging from multiple threads
  • Zero allocation on recording path (unless resizing which is optional, and then only if value exceeds initially specified max recorded value)
  • Constant time measurement which is less than cost of calling System.nanoTime() (on the cost, scalability and trustworthiness of nanoTime read Shipilev's excellent report)
It is truly as if the bees-knees and the dogs-bollocks had a baby, is it not?

Mama, what's a histogram?

"Histogram of arrivals per minute" by DanielPenfield
Own work. Licensed under CC BY-SA 3.0 via Wikimedia Common
Well, you could look it up on wikipedia, but the short answer is that a histogram is a summary of data in terms of the frequency of ranges of values.  So given the following data set [1,2,3,...,100000] captured for a second I could summarize in several ways:
  • I could capture the whole range of values each value in a bucket of it's own, assigning each value a frequency of 1 per second.
  • I could have a single bucket for values between 0 and 100,000, this bucket will have the frequency of 100,000 times per second.
These are both a bit silly, the first is as bad as dealing with the full set of data, the second is telling us nothing about the way the 100,000 values break down within the range. Still these are the 2 extremes of histograms, the alternatives lie within that range in terms of the data they offer but there are many ways to skin a cat (apparently, :( poor cats):
  • Capture the values in 1000 buckets, each bucket representing a range of 100 values: [1..100][101..200]...[99,901..100,000] that will result in 1,000 buckets each with a frequency of 100. This is the sort of histogram described above where all buckets capture the same fixed range.
  • Capture the values in 17 buckets, each bucket K representing a range [2^K..(2^(K+1)-1)] e.g. [1..1][2..3][4..7]...[65,536..131,071]. This would be a good solution if we thought most values are likely to be small and so wanted higher precision on the lower range, with lower and lower precision for larger values. Note that we don't have to use 2 as the base for exponential histogram, other values work as well.
Both of the above solutions trade precision across a large range with storage space. In both solutions I am required to choose up-front the histogram precision and we expect to pay for a large range with either space or precision. Now that we realize what the variables are we can describe these solutions:
  • Linear buckets: For a range 0..R we will have to pay R/B space for buckets of range B. The higher R is the more space we require, we can compensate by picking a large B.
  • Exponential buckets: For a range 0..R we require space of log2 of R. The bucket size grows exponentially as we track higher values.
The problem we face with latency data points is that the range of values we want to capture is rather large. It is not unreasonable to expect the latency outliers to be several orders of magnitude larger than the typical observed measurement. For example, it may well be that we are timing a method whose cost is in the 100s of nanoseconds, or a high speed connection round trip on the order of 1-100µs but on occasion our latency is dominated by some stop the world GC pause, or network congestion, which is in the order of 10ms to a few seconds. How can we correctly size the range of our histogram? Given the possibility of multi-second GC delays we need to cover a typical range of 1000ns to 100,000,000,000ns. If we used a linear histogram with a 100µs bucket we'd need 1,000,000 buckets (assuming an int counter this will add up to a ~4MB data structure).
The HdrHistogram follows a different path to the above solutions and manages to accommodate a large range with a high precision across the range in a limited space.

How does it work?

Here's what the documentation has to say:
"Internally, data in HdrHistogram variants is maintained using a concept somewhat similar to that of floating point number representation: Using an exponent a (non-normalized) mantissa to support a wide dynamic range at a high but varying (by exponent value) resolution. AbstractHistogram uses exponentially increasing bucket value ranges (the parallel of the exponent portion of a floating point number) with each bucket containing a fixed number (per bucket) set of linear sub-buckets (the parallel of a non-normalized mantissa portion of a floating point number). Both dynamic range and resolution are configurable, with highestTrackableValue controlling dynamic range, and numberOfSignificantValueDigits controlling resolution."
Hmmm... I'll admit to having difficulty immediately understanding what's happening from the above text, precise though it may be. I had to step through the code to get my head around what works why, read the above again and let it simmer. I'm not going to dig into the implementation because, while interesting, it is not the point of this post. I leave it to the reader to pester Gil Tene(the author of HdrHistogram) with implementation related questions.
The principal idea is a mix of the exponential and linear histograms to support a dynamic range precision that is appropriate to the time unit scale. At the scale of seconds we have a precision of milliseconds, at the scale of milliseconds we have a precision of microseconds etc. This translates roughly into exponential scale buckets which have linear sub-buckets.

Example: From raw recording to histogram

I have posted a while back a Java ping utility which measures the round trip between minimal client and server processes. Each round trip was recorded into a large array and every set number of round trips the measurements were summarized in percentiles:
Recording raw data is the simplest way of capturing latency, but it comes at a price. The long[] used to capture the latencies is ~8MB in size, this is for a million samples and in a real application can grow without bounds until some cutoff point where we decide to summarize or discard the data. When we want to report percentiles we have to sort it and pick the relevant data points. This is not usually an acceptable solution (because of the memory footprint), but it offers absolute accuracy and is trivial to implement (until you have to consider serialization, concurrency and visualization, but otherwise trivial).
Replacing this measurement method with a histogram is straight forward:
This histogram is 31KB when using 2 decimal places precision which is good enough in most case (according to JOL, add it to your utility belt if it ain't there already. Increasing the precision to 3 increases the size to 220KB), which is a large improvement over 8MB. We could reduce the memory consumption further if we were willing to limit the maximum data point count per bucket and use an integer/short histogram(ints seem like a reasonable choice).
If we print both measurements methods for the same run we can see the difference between the raw data and the HDR representation which is naturally slightly less accurate (# lines are HdrHistogram,@ is for raw data, each line represents 1M data points):
We can see that reported percentiles are pretty close to the raw data:

  • Note that the nanoTime on Mac reports in µs granularity, which is why the real values(@ lines) all end with 3 zeros.
  • Note that the max/min reported are adjusted to the correct histogram resolution (not a big deal, but slightly surprising).

What would have happened with our hand rolled solution? To keep a range of 0 to 60,000,000,000ns in a linear histogram of the same memory footprint we would need to limit ourselves to roughly 220k/8=64k buckets. Each bucket would have a granularity of roughly 1ms which would have translated to a very limited visibility on the lower end of the spectrum as most data sets are actually all under 1ms. This would have also completely skewed our percentiles (i.e 99.99% results in 1ms bucket, no breakdown of behaviour in percentiles). We could try and tackle the issue by picking a lower range to cover (which if applied to HdrHistogram will minimize memory usage further) or by blowing the memory budget on finer grained buckets.

Example: Aeron samples, much percentiles! Such graphs! Wow!

Percentiles are commonly used to describe latency SLA of a given system and a typical application will report a range of percentiles to reflect the probability of a given response time. In this context we can say that a 99%ile latency of 1ms means that 99% of all requests were handled in =< 1ms.
Aeron is a low latency, reliable UDP messaging library. Since latency is an obvious concern, all the samples utilize the HdrHistogram lib to demonstrate measurement and report results, here are the relevant excerpts from the Ping sample:
This results in a pile of text getting poured into the console, not that friendly:
Fear not, HdrHistogram comes packed with a handy charty thingy! Here's what the above histogram looks like when plotted:

To get this graph:
  • Save output above to a text file
  • Open a browser, and go here (the same HTML is in the project here)
  • Choose your file, choose percentiles to report and unit to report in
  • Export the picture and stick it in your blog!
This histogram was provided to me by Martin Thompson (one of the Aeron developers) and is from a test run in a performance lab. We can see that Aeron is delivering a solid 7µs RTT up to the 90%ile where latency starts to gradually grow. In this particular data set the maximum observed latency was 38µs. This is a great latency profile. It is far more common for the max and 99.99%ile to be orders of magnitude more that the 90%ile.
I could similarly plot this histogram using a gnuplot script to be found here. The gnuplot script is very handy for scripted reporting on large runs. It also allows for plotting several histograms on the same graph to allow visual comparison between benchmark runs for instance.

Example: Compressed histogram logging

Because the SLA is often specified in percentiles, it is common for applications to log only percentiles and not histograms. This leads to a reporting problem as it turns out that percentiles output cannot be combined to produce meaningful average percentiles. The solution would be to log the full histogram data, but who want a log that grows by 31KB every 5 seconds just to capture one histogram? Worry not, HdrHistogram comes with a compressed logging format and log writer and all that good stuff:
How many bytes is that logged histogram costing us? The 31KB histogram compressed down to 1KB in my example, but the result of compression will depend on the histogram captured. It is fair to assume that histograms compress well as the array of buckets is full of zeros (on the byte level) as most buckets are empty or low count.
If 1KB sounds like allot consider that a days worth of 10s interval histograms will result in an 8MB log file, which seems pretty acceptable even if you have a hundred such files. The benefit is that you will now have high precision interval latency data that you can reliably use to create longer interval latency data. You can use the HistogramLogProcessor to produce a full or partial log summary histogram for plotting as above.
I believe there's some truly exciting data visualizations one could build on top of this data, but sadly thats not where my skills lie. If you got skills to show of in this area I'm sure HdrHistogram would value your contribution.

Example: jHiccup and concurrent logging

jHiccup is a pause measurement tool used to capture OS or JVM 'hiccups'. It deserves it's own post but I'll try and summarize it in a few points:
  • jHiccup runs a HiccupRecorder thread which sleeps for a period of time (configurable) and measures the delta between the wakeup time and actual time. The failure to be re-scheduled is taken as a potential OS/JVM hiccup. The size of the hiccup is recorded in a histogram.
  • jHiccup can be run as an agent in your own process, an external process, or both.
  • jHiccup has been ported to C as well.
  • People typically use jHiccup to help charecterize and diagnose disruptions to execution in their system. While not every hiccup is the result of a STW pause we can use the jHiccup agent evidence correlated with an external jHiccup process and the JVM gc logs to support root cause analysis. A significant hiccup is a serious sign of trouble meaning a thread was denied from scheduling for the length of the hiccup. We can safely assume in most cases that this is a sign that other threads were similarly disturbed.
Gil Tene originally wrote HdrHistogram as part of jHiccup, but as HdrHistogram turned out to be more generally useful the two were split. The reason I bring jHiccup up in this context is that it serves as a regularly maintained full blown real world example of using an HdrHistogram.
jHiccup has 2 interesting threads, with roles that parallel many real world applications out there:
  • The measuring thread(HiccupRecorder): This is the thread that sleeps and wakes up and so on. The rate at which it does that is potentially quite high and we don't want to skew to measurement by performing IO on this thread. Similarly many real world application will have critical threads where it is not desirable to introduce IO. Since this is the case actual persistence will be performed on another thread.
  • The monitoring/logging thread(HiccupMeter): This thread will wake up at regular intervals and write the last interval's histogram to the log file. But since it is reading a histogram while another thread is writing to the histogram we now need to manage concurrency.
HdrHistogram offers a synchronization facility to serve exactly this use case in the form of the Recorder:
  • Recording a value in the recorder is a wait free operation (on JDK8, can be lock free on older depending on the getAndAdd implementation for AtomicLongArray).
  • The Recorder also comes in a single-writer flavour, which minimizes the concurrency related overheads.

Under the covers the recorder is using an active histogram and an inactive one, swapped seamlessly when an interval histogram is requested. Using a Recorder looks much like using a normal histogram(full code here):
And that's concurrent logging sorted ;-).

Summary

With HdrHistogram now hitting version 2.1.4 and offering a wealth of tried and tested functionality along with cross platform implementations and a standardized compressed logging format it is definitely time you gave it a go! May your latencies always be low!
If you are looking for a pet project and have a gift for UI thingies a latency explorer for the histogram interval logs would be an awesome contribution!