Friday, 21 October 2016

4 Years Blog Anniversary

This has been a very slow year blogging wise... too much work, travel, and family. 4 years ago I aimed for 2 posts a month. I nearly hit it in 2013, failed by a little in 2014, and went on to not meeting at all the last 2 years...
Coincidently I also have a delightfully busy 2.5 year old running around, I suspect she has something to do with how little time I have. My eldest has developed a terrible addiction to reading so I hardly see her these days, can't be blame for this one.
Since I started the blog I've also embarked on developing and maintaining JCTools, speaking at conferences, and starting a local JUG in Cape Town. Sharing of one kind breeds another, but you don't seem to gain bonus hours in the day. Hmmm.
If I sound like I'm making excuses it's because I'm still feeling bad about missing that 2 posts a month goal.
Despite my decreased output, blog traffic has steadily increased teaching me a thing or 2 about search and web economics. I hope to write more in the next year as I plan to give conferences a rest for the next 6-8 months, will see how it goes.
In any case, thanks for reading :-)

Monday, 18 July 2016

Fixing Coordinated Omission in Cassandra Stress

Copyright © 2016 Apache Software Foundation
I have it from reliable sources that incorrectly measuring latency can lead to losing ones job, loved ones, will to live and control of bowel movements. Out of my great love for fellow humans I have fixed the YCSB load generator to avoid the grave danger that is Coordinated Omission, this was met with great joy and I now gone on to fixup Cassandra Stress in similar vein. The fix is now merged into trunk, so expect to have it with the next Cassandra release (or just build it from source NOW).
Before we start on the how and what, let us take a step back and consider the why:
  • Coordinated Omission is a term coined by my esteemed colleague Gil Tene to describe a common measurement anti pattern where the measuring system inadvertently coordinates with the system under measurement (watch Gil's How NOT to Measure Latency talk). This results in greatly skewed measurements in commonly used load generators, and has led to long discussions on Mechanical Sympathy mailing list etc. I've given my own explanation in a previous post, so go and have a read if you need a refresher on the subject.
  • Cassandra Stress (CS) is a tool which comes bundled with Cassandra to enable load testing of a given cluster. It allows the user to specify their own queries and schemas and is the prefered tool for Cassandra benchmarking as it gives better access to Cassandra features etc. CS allows 2 testing modes: throughput(default) or 'limit' where a given number of operations per second are thrown at the cluster (the fix discussed here does away with limit and replaces it with throttle/fixed, read on).
So coordinated omission is bad, and it is now fixed in Cassandra Stress. This post talks a bit on motivation (part I), a bit on implementation (part II) and a bit on what you can do with the new features (part III). Feel free to skip any and all parts, god knows those selfies don't take themselves.

PART I: Demonstrating The Issue

CS default mode is to hit the system under test as hard as it can. This is a common strategy in load generators and can give system writers an interesting edge case to work with. I run the benchmark on my laptop (no attempt at finding out real performance numbers here, I just care about measurement issue) with the example provided workload I can saturate my Cassandra server (I only gave it a single core to run on) pretty easily. CS tells me the following about my miserable little server:
INFO  15:02:46 Results:
INFO  15:02:46 Op rate                   :    5,055 op/s  [insert: 500 op/s, simple1: 4,556 op/s]
INFO  15:02:46 Partition rate            :   17,294 pk/s  [insert: 12,739 pk/s, simple1: 4,556 pk/s]
INFO  15:02:46 Row rate                  :  150,266 row/s [insert: 12,739 row/s, simple1: 137,527 row/s]
INFO  15:02:46 Latency mean              :    4.5 ms [insert: 7.5 ms, simple1: 4.2 ms]
INFO  15:02:46 Latency median            :    3.0 ms [insert: 5.4 ms, simple1: 2.8 ms]
INFO  15:02:46 Latency 95th percentile   :   13.1 ms [insert: 20.1 ms, simple1: 11.9 ms]
INFO  15:02:46 Latency 99th percentile   :   23.8 ms [insert: 33.7 ms, simple1: 21.8 ms]
INFO  15:02:46 Latency 99.9th percentile :   49.8 ms [insert: 55.0 ms, simple1: 49.0 ms]
INFO  15:02:46 Latency max               :  603.5 ms [insert: 598.7 ms, simple1: 603.5 ms]
INFO  15:02:46 Total partitions          :  1,000,000 [insert: 736,585, simple1: 263,415]
INFO  15:02:46 Total errors              :          0 [insert: 0, simple1: 0]
INFO  15:02:46 Total GC count            : 112
INFO  15:02:46 Total GC memory           : 34.850 GiB
INFO  15:02:46 Total GC time             :    1.0 seconds
INFO  15:02:46 Avg GC time               :    9.4 ms
INFO  15:02:46 StdDev GC time            :   16.6 ms
INFO  15:02:46 Total operation time      : 00:00:57
With the other 3 cores on my laptop hitting it as hard as they could the median latency on this maxed out server was 3ms. That is pretty awesome. But also, it makes no sense.
How can a maxed out server have a typical response time of 3ms? In reality when servers are maxed out they are unresponsive, the typical response time becomes worse as the load increases. What CS is reporting however is not response time. It is 'latency'. Latency is one of those terms people use to mean many things and in this case in particular it does not mean "response time" but rather "service time". Here's a definition of more specific terms to describe system responsiveness(see wiki on response time):
  • Response time: The time between the submission of a request and the completion of the response.
  • Service time: The time between the initiation and completion of the response to a request.
  • Wait time: The time between the submission of the request and initiation of the response.
In an all out test one could argue we want all the results as soon as possible, and given a magic load generator they would all launch instantaneously at the benchmark start. This will mean submission time for all requests is at the beginning of the run. Naturally the server will not be able to respond instantly to all requests and can only allow so many requests to be handled in parallel. If the max throughput is 5000 ops/sec, and we are measuring for 100,000 ops it would mean 95K ops have waited a second, so their response time would be at least 1 second (response time > wait time). By the end of the run we would have 5K ops which have patiently waited at least 19 seconds (so 99%ile should be at least 19 seconds).
It follows that in an all out throughput benchmark response time is terrible by definition, and completely uninformative. It also follows that we should not expect the 'latency' above to be at all indicative of the sort of response time we would get from this server.
The alternative to an all out throughput benchmark is a Responsiveness Under Load (RUL) benchmark. Using Cassandra Stress one can (or rather they could before this fix went in) use the '-rate limit=17000/s' option to benchmark under a load of 17k pks/sec.(pks = partition keys, each operation costs X keys, throughput limit is specified in pks not ops) Running this gives me a warm fuzzy feeling, now for sure I shall get a glimpse of the response time at max throughput:
INFO  08:03:54 Results:
INFO  08:03:54 Op rate                   :    3,712 op/s  [insert: 369 op/s, simple1: 3,343 op/s]
INFO  08:03:54 Partition rate            :   12,795 pk/s  [insert: 9,452 pk/s, simple1: 3,343 pk/s]
INFO  08:03:54 Row rate                  :  110,365 row/s [insert: 9,452 row/s, simple1: 100,913 row/s]
INFO  08:03:54 Latency mean              :    1.0 ms [insert: 1.6 ms, simple1: 0.9 ms]
INFO  08:03:54 Latency median            :    0.7 ms [insert: 1.3 ms, simple1: 0.7 ms]
INFO  08:03:54 Latency 95th percentile   :    2.2 ms [insert: 3.4 ms, simple1: 2.0 ms]
INFO  08:03:54 Latency 99th percentile   :    4.6 ms [insert: 7.4 ms, simple1: 4.1 ms]
INFO  08:03:54 Latency 99.9th percentile :   13.4 ms [insert: 23.8 ms, simple1: 12.1 ms]
INFO  08:03:54 Latency max               :   63.9 ms [insert: 59.9 ms, simple1: 63.9 ms]
INFO  08:03:54 Total partitions          :    300,000 [insert: 221,621, simple1: 78,379]
INFO  08:03:54 Total errors              :          0 [insert: 0, simple1: 0]
INFO  08:03:54 Total GC count            : 33
INFO  08:03:54 Total GC memory           : 10.270 GiB
INFO  08:03:54 Total GC time             :    0.2 seconds
INFO  08:03:54 Avg GC time               :    7.5 ms
INFO  08:03:54 StdDev GC time            :    2.5 ms
INFO  08:03:54 Total operation time      : 00:00:23
This seems nice, and if I were not a suspicious man I might accept it. The thing is, I asked for 17k pks per second, but I only got  12,795 pk/s so obviously the server could not meet the implied schedule. If it could not meet the schedule, response time should be terrible. But it's not, it's very similar to the result we got above. Because? Because again, latency here means service time and not response time. While response time is not informative in an all out test, in an RUL benchmark it is the whole purpose of the benchmark. I have a schedule in mind, requests come at a particular rate, which implies they have a known start time (request n will start at: T0 + n/rate, T0 being the start of the test). This is coordinated omission, lets fix it.

Part II(a): It Puts The Measurements In The HdrHistogram or It Gets The Hose Again!

First off, I like to have HdrHistogram files to work with when looking at latency data (made a whole post about it too). They are usually tons better than whatever it is people hand roll, and they come with a bunch of tooling that makes data post processing easy. Importantly HdrHistograms are loss-less, configurable compact and support loss-less and compressed logging. Combine that with the high resolution of data and you have a great basis for post run analysis.
Cassandra Stress had it own sampling latency collection mechanism which would randomly drop some samples as a means to improve memory footprint, so the replacement improved reporting accuracy and reduced the amount of code. A side effect of this change is that Cassandra perf regression testing stability has improved rather dramatically. As indicated by this graph:

Because of the random sampling 99.9%ile reporting was unstable before the patch went in (May 28th), but smoooooth ever since. Ain't that nice?
Here's the commit with the main changes on the branch that became the PR, there were further changes introduced to enable histogram logging to files. The change is mostly tedious, but importantly it introduces the notion of split response/service/wait histograms. The data is recorded as follows:
The question is, what is the intended start time?

Part II(b): Where Do I Start?

Load generators, when not going all out, generally have some notion of schedule. It is more often than not quite simply a fixed rate of requests, though the notion of schedule holds regardless of how you come up with it. A schedule in this context means that for any event X there is a time when it should happen: st(X). That time is easily computed in a fixed rate schedule as: "st(n) = T0 + n/rate". Cassandra Stress however was using google's RateLimiter to provide it with the scheduling, and while battle tested and handy it does not have a notion of schedule. The replacement took place in 2 steps.
First I refactored the existing code into hiding the details of how operations are scheduled and where they come from behind a blocking queue like interface. The next step was to support a fixed rate stream of operations where the intended schedule is available so I can use it. This is what I ended up with (further tweaked to only start the clock when all the consumer initialization has happened):

Now we're all set to benchmark with no Coordinated Omission!

Part II(c): Who Reports What?

The measurement collection now all set up we face the question of what should different load generating modes report. Since it turned out that 'limit' was a confusing name (hard limit? upper limit?) it was decided to replace it with 2 distinct modes, adding to a total of 3 modes:
  • Throughput: latency == service time, response/wait time not recorded. Maximum throughput is an important test mode for flushing out CPU bottlenecks and contention. It may help in exposing IO configuration issues as well. It is not however a good predictor of response time distribution as no attempt is made to estimate independent request timing (i.e uncoordinated measurement). The maximum throughput number is valuable to batch processing etc, but I'd caution against using it for the sizing of a responsive system. If a system has responsiveness SLAs it is better to use the 'fixed' test mode and run for significant periods to determine the load under which response times SLA can be met.
  •  Throttle ("-rate throttle=1000/s"): latency == service time, response/wait time recorded like in fixed. This mode is a compromise for people who liked the old 'limit' measurement and want to measure "service time under attempted load". The partitions/second parameter is attempted for, but the summary does not reflect the response time. If you like, this is a way to sneakily record the response time while keeping the summary as it is so people are not so startled by the difference. I don't see myself using it, but have a blast.
  • Fixed  ("-rate fixed=1000/s"): latency == response time, service/response/wait time recorded. This mode is for benchmarking response time under load. The partitions/second parameter is used to determine the fixed rate scheduled of the requests. The hdr log can be used to visualize latencies over time or aggregate latencies for any segment down to the logging frequency. The logs contain response/wait/ service components that can be extracted and handled separately.
In all of the above you can choose to record the HDR histograms to a file on a fixed interval. The interval is the one used for summary reporting. To enable histogram logging use: " -log interval=10s hdrfile=cs.hdr"
Note that the interval should be quite long, and the default of 1s is not actually achievable by Cassandra Stress at the moment. This is down to the terrible way the reporting thread synchronizes with the load generation threads, and while it is one my wish list ("I wish I had time to fix it...") it was outside the scope of fixing CO, so lived to taunt us another day. I've settled on 10 second intervals for now, you can have even longer ones, all depends on the sort of granularity you want in your reports.

Part III: What Is It Good For?

So we got 3 load generating modes, and we've broken up latency into 3 components, the numbers are in histograms, the histograms are in the logs... let's have a little worked example.
I run a Cassandra 2.1 node on my laptop (old software, bad hardware... I don't care. This is not about absolute numbers, it's about the measurement features). To run this please:
  1. Clone the Cassandra trunk: git clone
  2. Build Cassandra and Cassandra Stress: ant clean build stress-build jar

I use the example workload provided by the good Cassandra folks, and start with an all out stress test from cassandra/tools/bin:
$ ./cassandra-stress user profile=../cqlstress-example.yaml duration=60s ops\(insert=1,simple1=9\) -mode native cql3 protocolVersion=2 -rate threads=3  -log interval=10s hdrfile=throughput.hdr
Joy, my laptop is giving me awesome throughput of 13,803 pk/s. The summary itself is pretty informative for throughput runs what do we win with the HDR log?
The log we produce can be summarized, manipulated and graphed by a bunch of utilities. Here's what the log looks like (this is from the fixed mode run):

Note that while we logged our latency in nanoseconds, the max column is in milliseconds. The nanosecond level measurements are still available in the compressed histogram to the right. Sadly it's not very friendly on the eye. HdrHistogram does include a log processing utility, but it offers quite basic facilities. I've put together a few utilities for histogram log management in HdrLogProcessing. These allow you to split, union and summarize logs and work with the tags feature. Lets make them into handy aliases:

  • hdrsum=java -jar HdrLogProcessing-1.0-SNAPSHOT-jar-with-dependencies.jar SummarizeHistogramLogs
  • hdrspl=java -jar HdrLogProcessing-1.0-SNAPSHOT-jar-with-dependencies.jar SplitHistogramLogs
  • hdruni=java -jar HdrLogProcessing-1.0-SNAPSHOT-jar-with-dependencies.jar UnionHistogramLogs

The throughput.hdr file we requested uses a recently added HdrHistogram feature which allows for the tagging of different histograms in one log. This makes it easy for applications logging histograms for different measurements to shove them all into a single file rather than many. Since we want to track 2 different operations with their separate response/service/wait times (so up to 6 files) this is quite handy. We can start by using the all in one HdrLogProcessing summary utility to add up the histogram data. The default mode of summary is percentiles and will produce a summary very similar to the one produced above by Cassandra Stress (using the parameters "-if throughput.hdr -ovr 1000000" to summarize in milliseconds):

We can have the microsecond or nanosecond report by tweaking the outputValueUnitRatio(ovr). We are also free to ignore tags and create an all inclusive summary by specifying the ignoreTag(it) parameter. Used together to create a total summary in microseconds we get ("-if throughput.hdr -ovr 1000 -it):

As a bonus, the summary tool allows us to process multiple files into the same result (logs from multiple benchmark runs for instance) and selecting periods within the logs to include.
The summary tool also supports generating the HGRM format which allows us to produce the following graph (using "-if throughput.hdr -ovr 1000 -st hgrm -of tpt" and the google charts provided in HdrHistogram):

Now imagine we used 3 different machines to stress a single Cassandra node. Because the logs are additive and loss less there's no issue with using the union tool to aggregate them all into a single log and process it as such. Similarly you can use the splitting tool to split out the operation you are interested in and manipulate it in isolation. Indeed the sky is the limit.
Now, with a stunning 13K partitions per second, an a 50ms 99.99%ile I might think it a reasonable idea to say my server can safely sustain a 10K pk/s rate and call it a day. I shall test this adventurous little estimate using the throttle mode:
$ ./cassandra-stress user profile=../cqlstress-example.yaml duration=60s ops\(insert=1,simple1=9\) -mode native cql3 protocolVersion=2 -rate throttle=10000/s threads=3 -log interval=10s hdrfile=throttle.hdr

There according to this, we should have no trouble at all with 10K pk/s. I mean sure it's a bit close to the SLA, but surely no too bad? The throttle mode is in this case just a way to keep your head in the sand a bit longer, but it does record the response time in case you feel like maybe comparing them. Let's compare the service time histogram and the response time histogram for this run. To get both operations service time histograms I need to play with the log processing a bit:

  1. Split the throttle.hdr file:"hdrspl -if throttle.hdr" -> This results in 6 different files <op>-<rt/st/wt>.throttle.hdr.
  2. Summarize the service time histograms:"hdrsum -if .*.-st.throttle.hdr -ovr 1000 -it -st hgrm -of throttle-st" -> we get throttle-st.hgrm
  3. Summarize the service time histograms:"hdrsum -if .*.-rt.throttle.hdr -ovr 1000 -it -st hgrm -of throttle-rt" -> we get throttle-rt.hgrm
  4. Load them into the google charts provided in HdrHistogram.
Take the red pill and you can keep your estimate, take the blue pill and you might keep your job.
This is a fine demonstration of just how far from schedule operations can get without service time measurement registering an issue. In other words the red line, service time is the measurement with coordinated-omission, the blue is measurement which does not suffer from it.
If you are finally convinced that you should be looking at response time instead of service time you can skip the throttle mode and move on straight to the fixed mode. The difference is only in what view of the world you want to see in your summary.
Finally, you can take the histogram logs and graph the latencies over time using HdrHistogramVisualizer. It currently does not handle tags well, so you'll need to split your logs first, but after that you can generate graphs as funky as this one (this is a longer run with fixed 10K rate, plotting the maximum latency for inserts at the top):
$ ./cassandra-stress user profile=../cqlstress-example.yaml duration=360s ops\(insert=1,simple1=9\) -mode native cql3 protocolVersion=2 -rate fixed=10000/s threads=3 -log interval=10s hdrfile=fixed.hdr
$ hdrspl -if fixed.hdr

This tells an interesting story, it seems as if the server was coping alright with the load to begin with, but hit a bump (GC? compaction?) after 225 seconds, and an even larger bump slightly later. That's lovely to know. I'm not much of a UI wiz and I'm sure some readers out there can make some valuable PRs to this project ;-)

Summary: What's new?

Given that with recent version of Cassandra Stress you can now test older versions of Cassandra as well (using the protocolVersion parameter as demonstrated above), you can stop using your crummy old version of Cassandra Stress, build trunk from source to get the following benefits:
  • HdrHistogram latency capturing and logging. Set the following options to get your very own histogram log: "-log interval=10s hdrfile=myawesomelog.hdr".
  • Have a look at the HdrLogProcessing project for some handy utilities to help you slice and dice the data. They are pretty simple and you can feel free to build/contribute your own.
  • 2 new load generation modes: throttle and fixed now replace the old limit. Use fixed to get a view on your cluster's response time under attempted constant load.
With HdrHistogram support now available in many languages, you can easily build post processing utilities for analysis in a language of your choice.
Have fun storming the castle!

Many many thanks to Jake Luciany of the Apache Cassandra project for reviewing the PR, helping to shape it, and merging it it. Jake is an Open Source hero, buy that man a drink on sight!
Both Jake and Chris Batey reviewed this post, if any errors remain it is down to their disgustingly sloppy and unprofessional ways, please let me know and I shall have words with their parents.

Thursday, 23 June 2016

The Pros and Cons of AsyncGetCallTrace Profilers

So, going on from my whingy post on safepoint bias, where does one go to get their profiling kicks? One option would be to use an OpenJDK internal API call AsyncGetCallTrace to facilitate non-safepoint collection of stack traces.
AsyncGetCallTrace is NOT official JVM API. It's not a comforting place to be for profiler writers, and was only implemented on OpenJDK/Oracle JVMs originally (Zing has recently started supporting AGCT to enable support for Solaris Studio and other profilers, I will write a separate post on Studio). It's original use case was for Solaris Studio, and it provides the following API (see forte.cpp, the name is a left over from the Forte Analyzer days). Here's what the API adds up to:

Simple right? You give it a ucontext and it fills in the trace with call frames (or it gets the hose again).

A Lightweight Honest Profiler

The 'async' in the name refers to the fact that AGCT is safe to call in a signal handler. This is mighty handy in a profiling API as it means you can implement your profiler as follows:
  1. In a JVMTI agent, register a signal handler for signal X.
  2. Setup a timer, triggering the signal X at the desired sample frequency. Honest Profiler is using the ITIMER_PROF option, which means we'll get signalled based on CPU time. The signal will be sent to the process and one of the running threads will end up being interrupted and calling into our signal handler. Note that this assumes the OS will distribute the signals fairly between threads so we will get a fair sample of all running threads.
  3. From signal handler, call AGCT: Note that the interrupted thread (picked at 'random' from the threads currently executing on CPU) is now running your signal handler. The thread is NOT AT A SAFEPOINT. It may not be a Java thread at all.
  4. Persist the call trace: Note that when running in a signal handler only 'async' code is legal to run. This means for instance that any blocking code is forbidden, including malloc and IO.
  5. WIN!
The ucontext ingredient is the very same context handed to you by the signal handler (your signal handler is a callback of the signature handle(int signum, siginfo_t *info, void *context)). From it AGCT will dig up the instruction/frame/stack pointer values at the time of the interrupt and do it's best to find out where the hell you landed.
This exact approach was followed by Jeremy Manson, who explained the infrastructure and open sourced a basic profiler (in a proof of concept, non commital sort of way). His great series of posts on the matter:
The same code was then picked up by Richard Warburton and further improved and stabilized in Honest-Profiler (to which I have made some contributions). Honest Profiler is an effort to make that initial prototype production ready, and compliment it with some tooling to make the whole thing usable. The serialization in a signal handler issue is resolved by using a lock-free MPSC ring-buffer of pre-allocated call trace structs (pointing to preallocated call frame arrays). A post-processing thread then reads the call traces, collects some extra info (like converting BCI to line number, jmethodIds into class names/file names etc) and writes to a log file. See the projects wiki for more details.
The log file is parsed offline (i.e. by some other process, on another machine, at some other time. Note that you can process the file while the JVM is still profiling, you need not wait for it to terminate.) to give you the hot methods/call tree breakdown. I'm going to use Honest-Profiler in this post, so if you want to try this out at home you're going to have to go and build it to experiment locally (works on OpenJDK/Zulu 6/7/8 + Oracle JVMs + recent Zing builds). I'll do some comparisons for the same experiments with JMC, you'll need an Oracle JVM (1.7u40 and later) to try that out.

What Does AGCT Do?

By API we can say AGCT is a mapping between instruction/frame/stack pointer and a call trace. The call trace is an array of Java call frames (jmethodId, BCI). To produce this mapping the following process is followed:
  1. Make sure thread is in 'walkable' state, in particular not when:
    • Thread is not a Java thread.
    • GC is active
    • New/uninitialized/just about to die. I.e. threads that are either before or after having Java code running on them are of no interest.
    • During a deopt
  2. Find the current/last Java frame (as in actual frame on the stack, revisit Operating Systems 101 for definitions of stacks and frames):
    • The instruction pointer (commonly referred to as the PC - Program Counter) is used to look up a matching Java method (compiled/interpreter). The current PC is provided by the signal context.
    • If the PC is not in a Java method we need to find the last Java method calling into native code.
    • Failure is an option! we may be in an 'unwalkable' frame for all sorts of reasons... This is quite complex and if you must know I urge you to get comfy and dive into the maze of relevant code. Trying to qualify the top frame is where most of the complexity is for AGCT.
  3. Once we have a top frame we can fill the call trace. To do this we must convert the real frame and PC into:
    • Compiled call frames: The PC landed on a compiled method, find the BCI (Byte Code Index) and record it and the jMethodId
    • Virtual call frames: The PC landed on an instruction from a compiled inlined method, record the methods/BCIs all the way up to the framing compiled method
    • Interpreted call frames
    • From a compiled/interpreted method we need to walk to the calling frame and repeat until we make it to the root of the Java call trace (or record enough call frames, whichever comes first)
  4. WIN!
Much like medication list of potential side effects, the error code list supported by a function can be very telling. AGCT supports the following reasons for not returning a call trace:

While this data is reported from AGCT, it is often missing from the reports based on it. More on that later.

The Good News!

So, looking at the bright side, we can now see between safepoint polls!!! How awesome is that? Lets see exactly how awesome by running the benchmarks which we could not measure correctly with the safepoint biased profiler in the previous post.
[Note: Honest-Profiler reports (t X,s Y) which reflect the total % of stack trace samples containing this method+line vs. the self % of samples in which this method+line is the leaf. The output is sorted by self.]:

Sweet! Honest Profiler is naming the right methods as the hot methods, and nailing the right lines where the work is happening.
Let's try the copy benchmark, this time comparing the Honest-Profiler result with the JMC result:
Note the difference in reporting switching on the "-XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints" flags makes to JMC. Honest-Profiler now (recent change, hot off the press) causes the flag to be on without requiring user intervention, but JMC does not currently do that (this is understandable when attaching after the process started, but should perhaps be default if the process is started with recording on).
Further experimentation and validation exercises for the young and enthusiastic:
  • Compare the Honest-Profiler results with JFR/JMC results on larger applications. What's different? Why do you think it's different?
  • Run with -XX:+PrintGCApplicationStoppedTime to see no extra safepoints are caused by the profiling (test with profiler on/off).
  • The extra-studious of y'all can profile the other cases discussed in previous post to see they similarly get resolved.
While this is a step forward, we still got some issues here... 

Collection errors: Runtime Stubs

As we can see from the list of errors, any number of events can result in a failed sample. A particularly nasty type of failed sample is when you hit a runtime generated routine which AGCT fails to climb out of. Let see what happens when we don't roll our own array copy:

Now, this great unknown marker is a recent addition to Honest-Profiler, which increases it's honesty about failed samples. It indicates that for 62.9% of all samples, AGCT could not figure out what was happening and returned ticks_unknown_java. Given that there's very little code under test here we can deduce the missing samples all fall within System.arrayCopy (you can pick a larger array size to further prove the point, or use a native profiler for comparison).
Profiling the same benchmarks under JMC will not report failed samples, and the profiler will divide the remaining samples as if the failed samples never happened. Here's the JMC profile for systemArrayCopy:

Number of samples(over 60 seconds) : 2617
Method::Line                              Samples   %
systemArrayCopy_avgt_jmhStub(...) :: 165    1,729   66.043
systemArrayCopy()                 ::  41      331   12.643
systemArrayCopy()                 ::  42      208    7.945
systemArrayCopy_avgt_jmhStub(...) :: 166       93    3.552
systemArrayCopy_avgt_jmhStub(...) :: 163       88    3.361

JMC is reporting a low number of samples (the total is sort of available in the tree view as the number of samples in the root), but without knowing what the expected number of samples should be this is very easy to miss. This is particularly true for larger and noisier samples from real applications collected over longer period of time.
Is this phenomena isolated to System.arrayCopy? Not at all, here's crc32 and adler32 as a further comparison:

What do Crc32 and System.arrayCopy have in common? They are both JVM intrinsics, replacing a method call (Java/native don't really matter, though in this case both are native) with a combination of inlined code and a call to a JVM runtime generated method. This method call is not guarded the same way a normal call into native methods is and thus the AGCT stack walking is broken.
Why is this important? The reason these methods are worth making into intrinsics is because they are sufficiently important bottlenecks in common applications. Intrinsics are like tombstones for past performance issues in this sense, and while the result is faster than before, the cost is unlikely to be completely gone.
Why did I pick CRC32? because I recently spent some time profiling Cassandra. Version 2 of Cassandra uses adler32 for checksum, while version 3 uses crc32. As we can see from the results above, it's potentially a good choice, but if you were to profile Cassandra 3 it would look better than it actually is because all the checksum samples are unprofilable. Profiling with a native profiler will confirm that the checksum cost is still a prominent element of the profile (of the particular setup/benchmark I was running).
AGCT profilers are blind to runtime stubs (some or all, this may get fixed in future releases...). Failed samples are an indication of such a blind spot.

Exercise to the reader:

  • Construct a benchmark with heavy GC activity and profile. The CPU spent on GC will be absent from you JMC profile, and should show as ticks_GC_active in your Honest-Profiler profile.
  • Construct a benchmark with heavy compiler activity and profile. As above look for compilation CPU. There's no compilation error code, but you should see allot of ticks_unknown_not_Java which indicate a non-Java thread has been interrupted (this is a conflated error code, we'll be fixing it soon).
  • Extra points! Construct a benchmark which is spending significant time deoptimising and look for the ticks_deopt error in your Honest-Profiler profile.

Blind Spot: Sleeping code

Since sleeping code doesn't actually consume CPU it will not appear in the profile. This can be confusing if you are used to the JVisualVM style reporting where all stacks are reported including WAITING (or sometimes RUNNING Unsafe.park). Profiling sleeping or blocked code is not something these profilers cover. I mention this not because they promise to do this and somehow fail to deliver, but because it's a common pitfall. For Honest-Profiler this feature should help highlight mostly off-CPU applications as such by comparing expected and actual samples (the delta is signals which were not delivered to the process as it was not running).

Error Margin: Skidding + inlining

I'll dive deeper into this one another time, since other profilers share this issue. In essence the problem is that instruction profiling is not accurate and we can often encounter a skid effect when sampling the PC(program counter). This in effect means the instruction reported can be a number of instructions after the instruction where the time is spent. Since AsyncGetCallTrace relies on PC sampling to resolve the method and code line we get the same inaccuracy reflected through the layer of mapping (PC -> BCI -> LOC) but where on the assembly level we deal with a single blob of instructions where the proximity is meaningful, in converting back to Java we have perhaps slipped from one inlined method to the next and the culprit is no longer in a line of code nearby.
To be continued...
But seriously guys, we should be using Fortran for this

Summary: What is it good for?

AsyncGetCallTrace is a step up from GetStackTraces as it operates at lower overheads and does not suffer from safepoint bias. It does require some mental model adjustments:
  1. -XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints : if this is not on you are still hobbled by the resolution of the debug data. This is different from safepoint bias, since you actually can sample anywhere, but the translation from PC to BCI kills you. I've not seen these flags result in measurable overhead, so I'm not convinced the default value is right here, but for now this is how it is. Honest-Profiler takes care of this for you, with JMC you'll need to add it to your command line.
  2. Each sample is for a single, on CPU, thread: This is very different from the GetStackTraces approach which sampled all threads. It means you are getting less traces per sample, and are completely blind to sleeping/starved threads. Because the overhead is that much lower you can compensate by sampling more frequently, or over longer periods of time. This is a good thing, sampling all the threads at every sample is a very problematic proposition given the number of threads can be very large.
AsyncGetCallTrace is great for profiling most 'normal' Java code, where hotspots are in your Java code, or rather in the assembly code they result in. This seems to hold in the face of most optimizations to reasonable accuracy (but may on occasion be off by quite a bit...).

AsyncGetCallTrace is of limited use when:
  1. Large numbers of samples fail: This can mean the application is spending it's time in GC/Deopts/Runtime code. Watch for failures. I think currently honest-profiler offers better visibility on this, but I'm sure the good folks of JMC can take a hint.
  2. Performance issue is hard to glean from the Java code. E.g. see the issue discussed in a previous post using JMH perfasm (false sharing on the class id in an object header making the conditional inlining of an interface call very expensive). 
  3. Due to instruction skid/compilation/available debug information the wrong Java line is blamed. This is potentially very confusing in the presence of inlining and code motion.
Using Honest-Profiler you can now profile Open/Oracle JDK6/7/8 applications on Linux and OS X. You can also use it to profile Zing applications on recent Zing builds (version and later, all JDKs). Honest-Profiler is lovely, but I would caution readers that it is not in wide use, may contain bugs, and should be used with care. It's a useful tool, but I'm not sure I'd unleash it on my production systems just yet ;-).
JMC/JFR is available on Oracle JVMs only from JDK7u40, but works on Linux, OS X, Windows and Solaris (JFR only). JMC/JFR is free for development purposes, but requires a licence to use in production. Note that JFR collects a wealth of performance data which is beyond the scope of this post, I whole heartedly recommend you give it a go.

A big thank you to all the kind reviewers: JP Bempel, Doug Lawrie, Marcus Hirt and Richard Warburton, any remaining errors will be deducted from their bonuses directly.

Monday, 21 March 2016

GC 'Nepotism' And Linked Queues

I've just run into this issue this week, and it's very cute, so this here is a summary. Akka has their own MPSC linked queue implementation, and this week in was suggested they'd swap to using JCTools. The context in which the suggestion was made was a recently closed bug with the mystery title: AbstractNodeQueue suffers from nepotism
An Akka user claimed to not be suffering from the bug because they were using JCTools, but as it turned out the issue was indeed present so I had to fix it etc... But what the hell is GC nepotism? You won't find it in your GC textbooks, but it's a catchy term for a very real problem.

Your Dead Olds Relatives Can Promote You!

In the open JDK a generational GCs, there are distinct generations, Old and New. The 2 generations get collected separately, but are not independent from each other. In particular, references from NewGen to OldGen will keep OldGen references alive through an OldGen collection cycle, and similarly references from OldGen to NewGen will keep objects alive through a NewGen collection. This is all quite straight forward, but has a surprising implication:
Dead OldGen objects will keep dead NewGen objects they reference alive until collected by an OldGen.
The reverse is also true, but since the whole notion of generational GC is based on the assumption that new gen is collected far more often then the old gen the above use case is the real issue. How does this effect you? Is this even a real problem?
Turns out it's real enough to have convinced Doug Lea to fix a particular manifestation of it, but also subtle enough to have slipped past him (and me, and the Akka guys, and Martin Thompson, and others no doubt) in the first place.

Linked Queues: Pathological Nepotism

Consider the following LinkedQueue implementation:
The following invariants are maintained:
  • head == tail <=> queue is empty
  • == null
  • tail.value == null
  • In summary: queue is empty <=> (head == tail && == null && tail.value == null)
The nodes are created and linked together on offer, and dereferenced as we poll. The poll method clears MUST clear the value because the queue retains the last node. If we were to let the last node keep the value we will have effectively leaked this value reference.
Ignoring the values, imagine the queue is part of the live set and we have enqueued a few elements, the arrow denotes what next references:
head -> N3 -> null
tail -> N0 -> N1 -> N2 -> N3

If we now dequeue the elements we'll get back to empty, brackets denote live referenced from dead:
head -> N3 -> null
tail -> N3 -> null
DEAD:  N0 -> N1 -> N2 (-> N3)

Dead objects are traditionally considered the GC problem, and cleaning up about to be collected objects is considered an anti-pattern. After all, what is the point of cleaning up garbage? you are just duplicating work. Indeed in a single generation collector this is a non-issue.
But consider for a moment the possibility that one of the above nodes was promoted into the OldGen. This is something which may not happen in short lived tests, or even 'lucky' long lived tests, but is obviously quite possible. We will have:
head -> N3 -> null
tail -> N3 -> null
Zombie NEW:  N1 -> N2 (-> N3)
DEAD OLD:  N0 (-> N1 -> N2 (-> N3))

The young are alive, as far as the old are concerned. The old are dead, but their promotion gives life to their 'children' until they get collected. To demonstrate this issue I've written the following program: 

If you look at the GC logs, or plug in Visual VM with the Visual GC plugin you will notice the young GC is entirely ineffective. Everything gets promoted to old, until an old GC clears the lot. After the old GC the problem disappears (in the test program, but is likely to repeat in real programs). This is 'pathological' because ANY promoted node will result in the promotion of ALL following nodes until a GC resolves the issue. 

At this point you might be shaking your head sadly and saying: "Ye thick old fart, surely this is a silly issue which NEVER happens in the real world". I've had my share of heads being shook at me and am not offended, but consider that this issue is real enough for other people:

An Exercise To the Reader

My plan was to walk through the way this issue has been fixed in JCTools vs Akka vs Agrona. Each solution reflecting different considerations, but sadly I'm quite short on time for a tidy summary and will give my observations which you can choose to read up on and argue with.
All three implementations follow the algorithm outlined by D. Vyukov. Note that Agrona is ignoring the Queue interface requirement for poll(return null iff queue is empty).

  • In Akka the next reference is nulled on poll. This solves the issue, but at the cost of further weakening the accuracy of the size() method.
  • In JCTools the next reference is set to point to the discarded node. This helps in telling the difference between a removed node and the head. This piece of information is used in the size method to skip gaps in the size counted chain when racing with the consumer thread. The result is a more accurate size estimate. The change does come at a cost to the size method however as we are no longer able to rely on the chain remaining whole throughout the size operation.
  • In Agrona, the next reference is nulled as in Akka. A further effort is made to reuse a special empty node to avoid repeating promotion of transient nodes in empty queues. The effort is arguably worth while on systems with large numbers of active but mostly empty queues.

Open questions/discussion:

  • How common is this issue for 'classic' data structures? it seems reasonable that a host of queues/stacks/tree data structures are similarly vulnerable. A scan of JDK sources suggests this is an known pitfall to them, but I would bet many third party frameworks writers didn't get the memo.
  • How common is this issue for idiomatic 'user' Java code? given the tendency towards composition rather than inheritance and the drive towards smaller objects, we may argue that user object graphs often approximate trees. If we have short-lived 'trees' a parent node can certainly drag a useless chunk of state to the old gen. If a reference bridges 2 trees we may have something approximating the pathological behaviour above. I'm not aware of any studies trying to determine the impact of nepotism on user code, part of the problem would be the definition of 'typical' user code.
  • Where does the term 'nepotism' come from? I made an attempt at tracing the origins, but got nowhere. The problem seems to be documented at least 8 years ago, but the term nepotism is not used by most writing. [UPDATE: it seems the term can be tracked down to a GC paper published in 1991:An Adaptive Tenuring Policy for Generation]

Wednesday, 24 February 2016

Why (Most) Sampling Java Profilers Are Fucking Terrible

This post builds on the basis of a previous post on safepoints. If you've not read it you might feel lost and confused. If you have read it, and still feel lost and confused, and you are certain this feeling is related to the matter at hand (as opposed to an existential crisis), please ask away.
So, now that we've established what safepoints are, and that:
  1. Safepoint polls are dispersed at fairly arbitrary points (depending on execution mode, mostly at uncounted loop back edge or method return/entry).
  2. Bringing the JVM to a global safepoint is high cost
We have all the information we need to conclude that profiling by sampling at a safepoint is perhaps a bit shite. This will not come as a surprise to some, but this issue is present in the most commonly used profilers. According to this survey by RebelLabs this is the breakdown:

VisualVM, NB Profiler(same thing), YourKit and JProfiler all provide a sampling CPU profiler which samples at a safepoint. Seeing how this is a rather common issue, lets dig into it.

How Sampling Execution Profilers Work (in theory)

Sampling profilers are supposed to approximate the 'time spent' distribution in our application by collecting samples of where our application is at different points in time. The data collected at each sample could be:
  • current instruction
  • current line of code
  • current method
  • current stack trace
The data can be collected for a single thread or all threads at each sample. What do we need to hold for sampling to work?
"However, for sampling to produce results that are comparable to a full (unsampled) profile, the following two conditions must hold. First, we must have a large number of samples to get statistically significant results. For example, if a profiler collects only a single sample in the entire program run, the profiler will assign 100% of the program execution time to the code in which it took its sample and 0% to everything else. [...]
Second, the profiler should sample all points in a program run with equal probability. If a profiler does not do so, it will end up with bias in its profile. For example, let’s suppose our profiler can only sample methods that contain calls. This profiler will attribute no execution time to methods that do not contain calls even though they may account for much of the program’s execution time." - from Evaluating the Accuracy of Java Profilers, we'll get back to this article in a bit
That sounds easy enough, right?
Once we have lots of samples we can construct a list of hot methods, or even code lines in those methods (if the samples report it), we can look at the samples distributed on the call tree (if call traces are collected) and have an awesome time!

How Do Generic Commercial Java Sampling Execution Profilers Work?

Well, I can sort of reverse engineer here from different solutions, or read through open source code bases, but instead I'll offer unsupported conjecture and you are free to call me out on it if you know better. Generic profilers rely on the JVMTI spec, which all JVMs must meet:
  • JVMTI offers only safepoint sampling stack trace collection options (GetStackTrace for the calling thread doesn't require a safepoint, but that is not very useful to a profiler. On Zing  GetStackTrace to another thread will bring only that thread to a safepoint.). It follows that vendors who want their tools to work on ALL JVMs are limited to safepoint sampling.
  • You hit a global safepoint whether you are sampling a single thread or all threads (at least on OpenJDK, Zing is slightly different but as a profiler vendor OpenJDK is your assumption.). All profilers I looked into go for sampling all threads. AFAIK they also do not limit the depth of the stack collected. This amounts to the following JVMTI call: JvmtiEnv::GetAllStackTraces(0, &stack_info, &thread_count)
  • So this adds up to: setup a timer thread which triggers at 'sampling_interval' and gathers all stack traces.
This is bad for several reasons, some of which are avoidable:
  1. Sampling profilers need samples, so it is common to set sampling frequency is quite high (usually 10 times a second, or every 100ms). It's instructive to set the -XX:+PrintGCApplicationStoppedTime and see what sort of pause time this introduces. It's not unusual to see a few milliseconds pause, but YMMV(depending on number of threads, stack depth, TTSP etc). A 5ms pause every 100ms will mean a 5% overhead (actual damage is likely worse than that) introduced by your profiler. You can usually control the damage here by setting the interval longer, but this also means you will need a longer profiling period to get a meaningful sample count.
  2. Gathering full stack traces from all the threads means your safepoint operation cost is open ended. The more threads your application has (think application server, SEDA architecture, lots of thread pools...), and the deeper your stack traces (think Spring and Co.) the longer your application will wait for a single thread to go round taking names and filling forms. This was clearly demonstrated in the previous post. AFAIK, current profilers do nothing to help you here. If you are building your own profiler it would seem sensible to set a limit on either quantities so that you can box your overheads. The JVMTI functionality allows you to query the list of current threads, you could sample all if there's less than a 100 and otherwise pick a random subset of 100 to sample. It would make sense to perhaps bias towards sampling threads that are actually doing something as opposed to threads which spend all their time blocked.
  3. As if all that was not bad enough, sampling at a safepoint is a tad meaningless.
Points 1 and 2 are about profiling overheads, which is basically about cost. In my previous post on safepoints I looked at these costs, so there's no point repeating the exercise.  Cost may be acceptable for good profiling information, but as we'll see the information is not that great.
Point 3 bears explaining, so off we go in the search for meaning.

Safepoint Sampling: Theory

So what does sampling at a safepoint mean? It means only the safepoint polls in the running code are visible. Given hot code is likely compiled by C1/C2 (client/server compilers) we have reduced our sampling opportunities to method exit and uncounted loop backedges. This leads to the phenomena called safepoint bias whereby the sampling profiler samples are biased towards the next available safepoint poll location (this fails the second criteria set out above "the profiler should sample all points in a program run with equal probability").
This may not sound so bad at first, so lets work through a simple example and see which line gets the blame.
NOTE: In all of the following examples I will be using JMH as the test harness and make use of the 'CompilerControl' annotation to prevent inlining. This will let me control the compilation unit limits, and may seem cruel and unusual, or at least unfair, of me. Inlining decisions in the 'wild' are governed by many factors and it is safe (in my opinion) to consider them arbitrary (in the hands of several compilers/JVM vendors/command line arguments etc.). Inlinining may well be the "mother of all optimizations", but it is a fickle and wily mother at that.
Let's look at something simple:

This is an easy example to think about. We can control the amount of work in the method by changing the size of the array. We know the counted loop has no safepoint poll in it (verified by looking at the assembly output), so in theory the methods above will have a safepoint at method exit. The thing is, if we let the method above get inlined the end of method safepoint poll will disappear, and the next poll will be in the measurement loop:

So, it would seem reasonable to expect the method to get blamed if it is not inlined, but if it does get inlined we can expect the measurement method to get blamed. Right? very reasonable, but a bit off.

Safepoint Sampling: Reality

Safepoint bias is discussed in this nice paper from 2010: Evaluating the Accuracy of Java Profilers, where the writers recognize that different Java profilers identify different hotspots in the same benchmarks and go digging for reasons. What they don't do is set up some benchmarks where the hotspot is known and use those to understand what it is that safepoint biased profilers see. They state:
"If we knew the “correct” profile for a program run, we could evaluate the profiler with respect to this correct profile. Unfortunately, there is no “correct” profile most of the time and thus we cannot definitively determine if a profiler is producing correct results."
So, if we construct a known workload... what do these profilers see?
We'll study that with the JMH safepoint biased profiler "-prof stack". It is much like JVisualVM in the profiles it presents for the same code, and a damn sight more convenient for this study.
NOTE: In the following sections I'm using the term sub-method to describe a method which is being called into from another method. E.g. if method A calls method B, B is a sub method of A. Maybe a better terminology exists, but that's what I mean here.
If we run the samples above we get 2 different hot lines of code (run with '-prof stack:detailLine=true'):
# Benchmark: safepoint.profiling.SafepointProfiling.meSoHotInline
....[Thread state: RUNNABLE]...
 99.6%  99.8% meSoHotInline_avgt_jmhStub:165

# Benchmark: safepoint.profiling.SafepointProfiling.meSoHotNoInline
....[Thread state: RUNNABLE]...
 99.4%  99.6% meSoHotNoInline_avgt_jmhStub:163

Neither one in the actually hot method. It seems that the method exit safepoint is not deemed indicative of it's own method, but rather of the line of code from which it was called. So forcing the method under measurement to not inline implicated the calling line of code in the measurement loop, while leaving it to inline meant the back edge of the loop got the blame. It also appears that an uncounted loop safepoint poll is deemed indicative of it's own method.
We may deduce(but we won't necessarily be correct) that when looking at this kind of profile without line of code data a hot method is indicative of:
  1. Some non-inlined sub method is hot*
  2. Some code (own method? inlined sub method? non-inlined sub method?) in an uncounted loop is hot*
Having line of code data can help disambiguate the above cases, but it's not really very useful as line of code data. A hot line of code would be indicative of:
  1. Line has a method call: A method called from this line is hot (or maybe it's inlined sub-methods are)*
  2. Line is a loop back edge: Some code (include inlined submethods) in this loop is hot*
  *Does that seem useful? Don't get your hopes up.
Because we usually have no idea which methods got inlined this can be a bit confusing (you can use -XX:+PrintInlining if you want to find out, but be aware that inlining decisions can change from run to run).

Mind The Gap

If the above rules held true you could use a safepoint biased profile by examining the code under the blamed node in the execution tree. In other words, it would mean a hot method indicates the hot code lies somewhere in that code or in the method it calls. This would be good to know, and the profile could serve as a starting point for some productive digging. But sadly, these rules do not always hold true. They are missing the fact that the hot code can be anywhere between the indicated safepoint poll and the previous one. Consider the following example:

Clearly, the time is spent in the loop before calling setResult, but the profile blames setResult. There's nothing wrong with setResult except that a method it calls into is not inlined, providing our profiler with a blame opportunity. This demonstrates the randomness with which the safepoint poll opportunities present themselves to  user code, and shows that the hot code may be anywhere between the current safepoint poll and the previous one. This means that a hot method/line-of-code in a safepoint biased profile are potentially misleading without some idea of where the previous safepoint poll is. Consider the following example:

The profiler implicates the caller to a cheap method 9 levels down the stack, but the real culprit is the loop at the topmost method. Note that inlining will prevent a method from showing, but not-inlined frames only break the gap between safepoints on return (on OpenJDK anyhow. This is vendor and whim dependent. e.g. Zing puts the method level safepoint on entry for instance and I'm not sure where J9 stands on this issue. This is not to say that one way is somehow better than the other, just that the location is arbitrary). This is why setResult6 which is not inlined and higher up the stack doesn't show up.

Summary: What Is It Good For?

As demonstrated above, a safepoint sampling profiler can have a wildly inaccurate idea of where the hot code in your application is. This renders derived observations on "Running" threads pretty suspect, but they are at least correct observations on which threads are running. This doesn't mean they are entirely useless and sometimes all we need is a hint in the right direction for some good analysis to happen, but there's a potential for huge waste of time here. While samples of code running in the interpreter do not suffer safepoint bias, this is not very useful as hot code is quickly compiled. If your hot code is running in the interpreter you have bigger fish to fry than safepoint bias...
The stack traces for blocked threads are accurate, and so the "Waiting" profile is useful to discover blocked code profile. If blocking methods are at the root of your performance issue than this will be a handy observation.
There are better options out there! I'll get into some of them in following posts:
- Java Mission Control
- Solaris Studio
- Honest-Profiler
- Perf + perf-map-agent (or perfasm if your workload is wrapped in a JMH benchmark)
No tool is perfect, but all of the above are a damn sight better at identifying where CPU time is being spent.
Big thanks to the many reviewers, if you find any issues please comment and I will punish them severely.

UPDATE: Read follow up post of Honest-Profiler/Java Flight Recorder type profilers in this next post - The Pros And Cons of AsyncGetCallTrace Profilers

Monday, 1 February 2016

Wait For It: Counted/Uncounted loops, Safepoints and OSR Compilation

In this previous post about Safepoints I claimed that this here piece of code:

Will get your JVM to hang and will not exit after 5 seconds, as one might expect from reading the code. The reason this happens, I claimed, is because the compiler considers for loops limited by an int to be counted, and will therefore not insert a safepoint poll into the generated assembly code. This was the theory, and I tested it, and it definitely hung on my machine ('It Works On My Machine'™).
To my great surprise, some people not only read the posts, they also run the code (it's like they don't trust me ;-) ). And 2 of those diligent readers got back to me in the comments saying that actually the JVM exists just fine. So I ran it again, and it hung, again. Then I ran it with different versions of the JVM, and it hung as I expected. Then I thought, maybe they are running the code I put on GitHub, so I tried that. That also hung when I ran it from Eclipse, but did not when I built it with Maven and ran it from the command line. EUREKA!
As it happens, Eclipse has it's own Java compiler (a fact well known to those who know it well, no doubt) and the bytecode it generated (the .class file) hung every time, but Maven uses javac and the .class file generated from that did not hang. When testing initially I was running my class inside Eclipse, or using the Eclipse .class file to run from the command line. Both .class files are valid, but different (same-same but different). We can use javap to print out class bytecode and compare ("javap -c -v -p <class file>"). For your pleasure you can find the original Java file, the 2 class files and their respective bytecode printouts on GitHub and come to your own conclusions before you continue reading.
So... is my understanding (and the Eclipse javac) somehow out of date? can this problem of telling counted loops from uncounted loops be solved by the bytecode compiler?

How is the bytecode different?

Well, mostly not that different:
  1. An easy to spot difference which we can tell is semantically the same is the bytecode result for "System.out.println("How Odd:" + l);" you'll notice that Eclipse is using the constructor "java/lang/StringBuilder."<init>":(Ljava/lang/String;)" while javac is opting for "java/lang/StringBuilder."<init>":()V" followed by an append. Semantically the 2 choices are the same. We create a StringBuilder and append the String "How odd:" and the variable l. The reality is that there are subtle differences, the Eclipse version will allocate a larger char[] for instance. It may also be that the Eclipse pattern will miss out on some String concatenation optimization because the OpenJDK implementers are likely to focus on patterns generated from the official javac.
  2. The loop interpretation is different. Eclipse is checking the loop conditions at the end of the loop whereas javac is building a loop with the checks upfront and the loop backedge at the end. That seems like the important difference.
Let's zoom in on the relevant bytecode for the loop:

If we were to reverse translate the loop bytecode to ugly Java (which supported goto) we'd get:

Note that there's no special Java bytecode for loop control flows. So why so different at runtime? If we dump out the assembly for the run method we will see that the Eclipse version is missing the safepoint in the loop (the method has a {poll_return} but there's no {poll}), and the javac version has that safepoint there. Is it the bytecode difference that makes this happen?

Variations on bytecode

I tinkered with the class a bit and refactored the loop out of the lambda. This results in similar bytecode for the loop and same behaviour. I then made the limit of the loop a parameter of the method, this time it behaved like the javac version, but had the same funky loop structure:

WTF? if this is a purely a bytecode subtlety then it's seems pretty weird. What sort of flaky compiler would swing like this?

The compiler did it?

But... which compiler? so we looked at the above bytecode compiler results, this is but the first of many compilers to process your code on it's way to being executed. Traditionally when discussing Java code compilation we look at the following stages:
  1. Javac: From code to bytecode
  2. Interpreter: Step through the bytecode
  3. C1/Client: Short compilation time, does not generate the best optimized code. Intended for short lived applications.
  4. C2/Server: Longer compilation time, better optimized code. Intended for long running applications.
And the compilation timeline would look something like this:
The situation with OpenJDK8 is more complex with Tiered Compilation allowing us a smooth transition from interpreter to C2 compilation via C1 compilations. 
Given the behaviour we are after was supposed to manifest after a C1/C2 compilation we can run the code above to find which compiler compiled the method in question by adding the flag: -XX:+PrintCompilation:
    159  201 %     3       safepoint.hang.WhenWillItExitInt2::countOdds @ 13 (44 bytes)
    159  202       3       safepoint.hang.WhenWillItExitInt2::countOdds (44 bytes)
    159  203 %     4       safepoint.hang.WhenWillItExitInt2::countOdds @ 13 (44 bytes)
    162  201 %     3       safepoint.hang.WhenWillItExitInt2::countOdds @ -2 (44 bytes)   made not entrant
    142  200 %     3       safepoint.hang.WhenWillItExitInt2::countOdds @ -2 (44 bytes)   made not entrant
   1896  202 %     4       safepoint.hang.WhenWillItExitInt2::countOdds @ -2 (44 bytes)   made not entrant
   1896  203 %     3       safepoint.hang.WhenWillItExitInt2::countOdds @ 13 (44 bytes)
   1896  204 %     4       safepoint.hang.WhenWillItExitInt2::countOdds @ 13 (44 bytes)
   1899  203 %     3       safepoint.hang.WhenWillItExitInt2::countOdds @ -2 (44 bytes)   made not entrant
The 3/4 printouts on the 4th column above indicate the method is compiled by C1 (profiled) and C2. But what is that % about? This is an indicator that the compilations are not 'normal' compilations but rather On Stack Replacement(OSR) compilations. OSR compilations are 'special', from the OpenJDK glossary:
on-stack replacement
Also known as 'OSR'. The process of converting an interpreted (or less optimized) stack frame into a compiled (or more optimized) stack frame. This happens when the interpreter discovers that a method is looping, requests the compiler to generate a special nmethod with an entry point somewhere in the loop (specifically, at a backward branch), and transfers control to that nmethod. A rough inverse to deoptimization.
I like the "rough inverse to deoptimization", it's a great mental qualification helper. OSR is the opposite of deoptimization because it is a mid-method replacement of code with compiled code, whereas a deoptimization is a mid-method replacement of code with interpreted code. Normal compilation unit replacements happen out of method, so while the method is off stack. OSR is specifically done to replace long running loops, like our countOdds method. As we can see from the printout above both C1 and C2 can produce OSR compilations.
It is impossible to disable OSR directly in a product build (there's a developer option -XX:-CICompileOSR), but we can force upfront compilation by using the following flags: "-XX:-TieredCompilation -server/client -Xcomp". Running with these arguments all 3 variations (original, refactored loop, refactored parametrized loop) now hang both for the Eclipse and the javac generated bytecode.
In both cases where the Eclipse versions hang we can see OSR compilations of the relevant code.
So it seems:
  • C1/C2 non-OSR compilations of the loop treat it as a counted loop, thus eliminating the safepoint poll and demonstrating the hung behaviour due to huge TTSP.
  • C1/C2 OSR compilations of the same code result in different results depending on the generated bytecode for the loop. I'm not going to dig further into the reasons, but I would welcome any feedback from OpenJDK compiler guys.
The OSR compilation decision to treat the loop as non-counted is quite reasonable. If we have a loop which remains on stack for very long periods we should perhaps consider it uncounted.

Killer Task

I started down this path looking for valid Java code which will result in a hung JVM due to huge TTSP. Since OSR got in the way, can we build such a task? Sure we can, we just need to warmup the loop until it is C2 compiled and then run it with a large limit, like so:
This fucker now hangs in both the Eclipse and javac versions. Joy!

Side note: OSR and JMH

I look at benchmarks people write quite a bit, and it is very common to see people rolling their own benchmark harnesses put their "code under measurement" in a counted loop, run it 100,000 times and derive their results. Often that means the code they are measuring is OSR compiled code, which is different (for better or worse, mostly worse) from the code they will be running normally. If you are one of these people, I beg you: cut that fucking shit out!
To a large extent it will cover your use case. Where it doesn't cover your use case, post a question on jmh-dev mailing list, it might be that you've missed something. If it turns out JMH cannot cover your usecase, make sure you've read the JMH samples before rolling your own harness.
I'm not going to include a before and after example, because it just makes me sad. Look at SO for some fine examples.

Counted Loops?

I keep saying 'counted loops' like it means something, so here are some examples for you to play with and contemplate:

So, as you can see the definition ain't so clear and I would imagine some of the above behaviour will vary depending on JVM version/vendor/phase of the moon etc.


So, it seems my previous example was not as tested as it should have been. I apologize to anyone who has based life changing decisions on that sample. I hope this post helps in explaining away the subtleties which led to my downfall.
The above examples help demonstrate that the line between counted and uncounted loops is a tad confusing and the classification may depend on differences in bytecode, compiler mode, and so on.