Monday, 27 July 2015

JMH perfasm explained: Looking at False Sharing on Conditional Inlining

There is an edge that JMH (read the jmh resources page for other posts and related nuggets) has over other frameworks. That edge is so sharp you may well cut yourself using it, but given an infinite supply of bandages you should definitely use it :-) This edge is the ultimate profiler, the perfasm (pronounced PERF-AWESOME!, the exclamation mark is silent). I've been meaning to write about it for a while and as it just saved my ass recently...

SpscGrowableArrayQueue's False Sharing Issue

JCTools includes a specialized SPSC (single-producer/consumer) bounded queue aimed at actor systems, the SpscGrowableArrayQueue. This queue is quite neat because it is combining the compactness of a linked queue with the awesome throughput of an array backed queue. The idea is quite simple, and similar in spirit to an ArrayList:
  1. Start with a small buffer, default to 16 elements
  2. As long as the queue doesn't grow beyond that just stay small
  3. If on offer you find that the queue is full double the size of the underlying buffer
The mechanics of the queue resizing, handling intermediate queue states, and detecting a new buffer from the consumer are a bit involved and will perhaps be expanded on some other day.
Because the queue is geared at actor systems per queue footprint is important, so I reworked the memory layout of SpscArrayQueue and skimped on the padding where possible. If you have no idea what I mean by padding, and why it is done you can read the following posts:
  1. False sharing and the MESI protocol details related to cached counters: This explores the motivation of padding fields in the first place
  2. Discovering and controlling object layout(Using an early version of JOL): This show cases the JOL tool and how it should be used to discover object layout
  3. SPSC Revisited - part I: An empiricist tale: This post discusses the inlining of counters into a queue use inheritance to control field order and introduce padding. In this post I show how to pad the counter from each other and also how to pre/post pad the class and buffer.
  4. A more high level summary of false sharing is given in this post
With an overhead of roughly 128b per padding this was instrumental to reducing the high memory cost per queue. The hot counters were still padded from each other, but the class pre/post padding were removed as well as the padding of the elements array.
So, I gave up on some of the padding, but reasoned that in most cases this should not make a difference because the hottest fields were still padded from each other.

Why so slow?

Now, I expect a queue that is very similar to SpscArrayQueue, but adds some features to be slower. There's just very little you can do about this, doing more will cost you something. But given that resizing is a rather exceptional event for this queue I thought this will be a minor hit, maybe 10-20% reduction in performance (for a certain definition of performance). JCTools has some benchmarks included and I ran the QueueThroughputBackoffNone which will have a producer and consumer threads chewing on the queue as hard as they can.
Since the numbers are not very important here, I'll stick to running benchmarks on my laptop (an MBP 11,1/ Ubuntu 15.04/Zulu JDK8u45 - Zulu is a supported and tested OpenJDK build) in this post. Rest assured that I have confirmed on real hardware the same results. The results I quote below are the pollsMade figure which reflects the actual delivered throughput.
To my puzzlement I found:
SpscArrayQueue          361.223 ±  7.156  ops/us
SpscGrowableArrayQueue   64.277 ± 31.803  ops/us

Crap performance and terrible variance, bugger me sideways.
Looking at the code, I thought there must be something I was doing to upset the mighty JIT spirits. Maybe my methods were too big? My branches too unpredictable? My variable names too offensive? So I tweaked the code this way and that, looked at the inlining log (-XX:+PrintInlining) and the assembly (-XX:+PrintAssembly/-XX:CompileCommand=print,*) I got some minor improvements, but it mostly still sucked. What's wrong? A quick look at "-prof perfasm" and a fair amount of head scratching lead to the answer. The code (before and after) is on github, the main focus for this post is perfasm and its usage so I won't dive into it.

Before we start: How does perfasm work?

To use perfasm you'll need a Linux(or Windows) OS, running on real hardware, the relevant perf tool installed, and a JVM setup to print assembly.
Perf is (amongst other things) an instruction level profiler, of the kind that don't usually work with Java traditionally (though things are slowly changing). One of the features offered by perf is the "perf record/annotate" workflow. Perf interrupts your process repeatedly and records the current PC (program counter) at each sample. This sampling of the program counter is recorded over a period of time to be post processed by the annotate feature which correlates samples to methods, lines of code and assembly instructions. The challenge for perf when dealing with Java code is that the binary form of each method only exists for the duration of that process lifetime. This means the PC is mostly referring to methods that are nowhere to be found when the annotation stage comes along.
To summarize: perf record works, but perf annotate is broken for Java.
To make perfasm work JMH captures the JVM compiler outputs by enabling the following flags:
  -XX:+UnlockDiagnosticVMOptions
  -XX:+LogCompilation
  -XX:LogFile=...
  -XX:+PrintAssembly
  -XX:+PrintInterpreter
  -XX:+PrintNMethods
  -XX:+PrintNativeNMethods
  -XX:+PrintSignatureHandlers
  -XX:+PrintAdapterHandlers;
  -XX:+PrintStubCode
  -XX:+PrintCompilation
  -XX:+PrintInlining
  -XX:+TraceClassLoading
  -XX:PrintAssemblyOptions=syntax
The data collected here allows perfasm to do the annotation process by itself. I'll not bore you with the details of the output processing, but hat off to Mr. Shipilev who ploughed through the gory mess and made this work. The end result is a detailed output offering you the assembly of the hottest regions in your program, along with a list of the top hottest methods including both native and Java parts of the stack.
Because perf allows the recording of any number of events the definition of hottest depends on the events you choose to profile. The default events are cycles and instructions, but you can specify any number of events (E.g. -Djmh.perfasm.events=cycles,cache-misses). The first event specified will be the 'hotness' qualifier.

What do you get?

The perfasm output is split into 4 sections:
  1. Annotated assembly output for the top HOT regions(titled "Hottest code regions (>10.00% "cycles" events):")
  2. A list of the hot regions in your benchmark (titled "[Hottest Regions]"). This is the list of compiled methods inclusive of native methods, which is why we have the next section.
  3. A list of the hottest methods after inlining, so only java methods in this section (titled "[Hottest Methods (after inlining)]").
  4. A distribution of the cycles between types of regions (titled "[Distribution by Area]"). This will inform you of how the split goes between compiled code, kernel, JVM etc.
The most interesting of the 4 sections (and the only one I'm going to explain here) is the annotated hot regions section. I tend to edit the output of assembly spouting tools to make it more readable, but for the sake of explaining what's what here's the original output with footnotes:
To focus on the main features I trimmed out allot of the assembly code, which is where you see the "<MORE OF SAME/>" comments. Note the legend at the start of the output describing the 2 left most columns: cycles and instructions sample percentages.
Here's the footnotes explained:
  1. (Line 7) This column is the instruction address. This is important as this column is both how perfasm matches the perf record output with the PrintAssembly output and how poor sods like yours truely have to look for jump destinations in the code.
  2. (Line 7) This column has the actual instructions, in AT&T syntax (src -> dst).
  3. (Line 7) This column is the instruction annotation generated by PrintAssembly. It is not always correct, but it's helpful in connecting the dots. Multi-level inlining of methods can be seen in action everywhere.
  4. (Line 12) The condition/loop edge annotation is a wonderful recent addition to perfasm. It makes life marginally more bearable for the assembly consumer by connecting the jumps with their destinations. This is the start point annotation ""
  5. (Line 16) This is the end point annotation ""
When looking at this you will be searching for suspect hot spots in the code. Some instruction which is chewing away your cycles. If you're in luck you'll be able to tweak your code just right to side step it and then you are off and away :-)

Back on Track: Show me the cycles!

Right, so I point Captain Awesome at my benchmark and what do I see (trimmed addresses and edges as there's no important conditions here, also trimmed package names and moved code annotation left):
This instruction, eating 30% of my cycles (line 7, needs HEALING!), is part of a guard generated to ensure that inlining the poll method call is still a valid decision, that the instance observed at the callsite is of the class for which we inlined the poll method. This is not really a Java code line, this is the aftermath of inlining a virtual call (through an interface, see Aleksey's method dispatch inlining article to learn more on inlining).
To be fair, the above is kind of hard to understand. Why is a comparison between a constant and a register eating 30% of the cycles? according to every instruction 'cost' manual this should take 1 cycle (and the CPU can do 3-4 of them in parallel too), this is obviously not the instruction I'm looking for.
This is a phenomena known as 'skid' where the reported instruction at a given sample is inaccurate because modern CPUs are complicated. See the following dialogue on the linux-perf-user mailing list:
> I think Andi mentioned this to me last year -- that instruction profiling was no longer reliable. 
It never was. 
> Is this due to parallel and out-of-order execution? (ie, we're sampling the instruction pointer, but that's set to the resumption instruction, not the instructions being processed in the backend?). 
Most problems are due to 'skid': It takes some time to trigger the profiling interrupt after the event fired. [...] There are also other problems, for example an event may not be tied to an instruction. Some events have inherently large skid.
This sounds grim, but at the end of the day this is as accurate as profiling can get. It's not perfect, but it's still very valuable and you get used to it (there are ways to minimize skid discussed in the link, I've not tried those).
What you end up doing is looking for instructions just before the blamed instruction, or instructions on which the blamed instruction is dependant, which may be more reasonably blamed for the bottleneck. Given the CMP is not the problem, we must ask why would the CPU spend so much time at it? A CMP or a TEST will often get blamed for the price of the load into the registers they use, in this case the CMP is most probably being blamed for the load of the queue type from the object header one instruction back:
"0x8(%r12,%r11,8),%r8d  ; implicit exception: dispatches to 0x00007f83c93f929d"
It doesn't help that the comment talks about some implicit exception (this is an implicit null check), where it could say "getkid(q) + implicit_nullchk(q)" or something similar to indicate we are loading the kid (klass id) from the object header (see object header layout details here).
Now that I had pointed a finger at this operation, I was still quite confused. This was not an issue for any of the other queues, why would loading the kid be such a bottleneck in this case? Maybe I'm wrong about this (always a good assumption)? To prove this is the issue I created a duplicate of the original benchmark where I use the SpscGrowableArrayQueue directly instead of going via the interface, for comparison I also benchmarked the SpscArrayQueue in the same fashion:
SpscArrayQueue          369.544 ±  1.535  ops/us
SpscGrowableArrayQueue  272.021 ± 12.133  ops/us

Now that's more like it! the expected 20% difference is more like 30%, but this is much closer. This still begs the question, why is the type check for SpscGrowableArrayQueue so expensive? We can see that for SpscArrayQueue this makes very little difference, how is SpscGrowableArrayQueue different?

Messing with the KID

I had a long hard look at this issue, which didn't help, then slept on it, which did help, and realized the problem here is that the object header is false-sharing with the producer fields. When I trimmed down the padding on this class in an effort to minimize allocation, I removed the array post and pre-padding as well as the class pre/post padding and reasoned that for the most part I need not worry about the object's neighbours false sharing. What I failed to realize was that the consumer and producer threads might be frequently hitting the object header, in this benchmark on every call. Once I realized this was the issue I reinstated the pre-padding such that the producer index is far enough from the object header to stop interfering with it and the problem went away (see before and after, it's a messy diff as I fixed some other niggles while I was there, you can take the last version and play with adding/removing the padding and reordering the index and cold fields to verify my claims).
Here's the (slightly trimmed for readability) perfasm output for the same conditional inlining check in the padded version:

As an interesting side note, this issue was not visible in the hand rolled throughput benchmarks. This is because in those benchmarks the queue is hoisted into a variable before the consume/produce loop which means the conditional inlining check can be done out of loop as well. This is great for getting nice numbers but hides an issue which users are likely to hit. Credit goes to JMH for herding benchmarks down a path which forces these issues into the measured scope.

Summary

The main focus of this post is perfasm, I hope it helps get you started on using it. The broader context in which you would use this tool is a less explicit background. I use perfasm regularly, but I'm also reasonably happy to look at assembly and compiler annotation which I know most people are not. I find it to be invaluable in issues, like the one described in this post, where the cost is between the Java lines rather than the lines themselves. Any profiler can give you a broad feel of where performance bottlenecks are, and a non-safepoint-biased profiler can show you the hottest line of code. What a Java profiler will not tell you is about the generated JVM code between your lines. It will attribute the cost of those instructions to the nearest Java line, and that can become a very confusing chase.
It also worth pointing out that any nano-benchmark (measuring operations in the 0-500 nanoseconds range) practically requires you to look at the assembly for it's analysis. But 500ns can still be an aweful lot of code and an assembly level profiler is very handy. At this point it worth mentioning the Oracle Solaris Studio (first prize for least known profiler every year since release). It is a great assembly level profiler, and just generally a great profiler. If your measurement needs to take place outside the cosy comforts of a JMH benchmark I would recommend you give it a spin.
Finally, this investigation came in the context of a workflow that is followed in the development of JCTools. I would loosely describe it as follows:
  1. Implement new feature/data structure
  2. Reason about expected impact/performance as part of design and implementation
  3. Test expectations using the existing set of benchmarks
  4. Expectations are far off the mark (if not... not sure, will figure it out when it happens)
  5. Dig in until either expectations or code are cured.
This has a whiff of scientific exploration to it, but I assure you it is not done quite so seriously and I often fail to follow my own advice (or worse other people's advice). The habit of testing performance assumptions/expectation has offered me many an afternoon spent banging my head on a variety of surfaces. Perfasm has been instrumental in reducing the amount of head banging, but I fear nothing short of permanent brain damage will actually solve the problem.
This post has been kindly reviewed by Aleksey Shipilev, Darach Ennis and Richard Warburton. Any remaining errors are entirely their fault ;-)

Friday, 22 May 2015

Object.equals, primitive '==', and Arrays.equals ain't equal

It is a fact well known to those who know it well that "==" != "equals()" the example usually going something like:
  String a = "Tom";
  String b = new String(a);
  -> a != b but a.equals(b)

It also seems reasonable therefore that:
  String[] arr1 = {a};
  String[] arr2 = {b};
  -> arr1 != arr2 but Arrays.equals(arr1, arr2)

So far, so happy... That's examples for you...
For primitives we don't have the equals method, but we can try boxing:
  int i = 0;
  int j = 0;
  -> i == j and also ((Integer)i).equals((Integer)j), and Arrays.equals({i}, {j})


Floats ruin everything

But... some primitives are more equal then others. Consider for instance the following example:
  float f1 = Float.Nan;
  float f2 = Float.Nan;
  -> f1 != f2, and also f1 != f1, Nan's are nasty

This is what floats are like children, they be treacherous little hobbitses and no mistake. The fun starts when you box them fuckers:
  -> ((Float)f1).equals((Float)f2), because Java likes to fix shit up
  -> Arrays.equals({((Float)f1)},{((Float)f2)}), hmm consistent
  -> but also: Arrays.equals({f1},{f2})...

This is counter to how one would normally think arrays are compared. You would think that for primitives (skipping arrays null and length checks):
  boolean feq = true;
  for(int i=0;i<farr1.length;i++){
    if(farr1[i] != farr2[i]) {
      feq = false; break;
    }
  }
Is the same as:
  boolean feq = Arrays.equals(farr1,farr2);
But for double[] and float[] the contract has been changed to accommodate Nan. This is perhaps a good decision on the JDK authors side, but it is somewhat surprising.
Conclusion: 2 arrays can be equal, even if some elements are not equal.

Objects are weird

Let's consider objects for a second. We started with a != b does not predicate that !a.equals(b), but what about a == b?
The Object.equals() javadoc offers the following wisdom:
The equals method implements an equivalence relation on non-null object references:

  • It is reflexive: for any non-null reference value x, x.equals(x) should return true.
  • It is symmetric: for any non-null reference values x and y, x.equals(y) should return true if and only if y.equals(x) returns true.
  • It is transitive: for any non-null reference values x, y, and z, if x.equals(y) returns true and y.equals(z) returns true, then x.equals(z) should return true.
  • It is consistent: for any non-null reference values x and y, multiple invocations of x.equals(y) consistently return true or consistently return false, provided no information used in equals comparisons on the objects is modified.
  • For any non-null reference value x, x.equals(null) should return false.
You know what the most important word in the above lines is? SHOULD
In the wonderful land of Should, all classes behave and if they bother implementing an equals method they follow the above rules (and also override the hashcode method while they are there).
But what happens when an object is implement perversely, like this:
  public class Null {
    @Override
    public boolean equals(Object obj) {
        return obj == null;
    }
  }

This has some nice implications:
  Null n = new Null();
  -> n != null, but n.equals(null)
  Object[] a = {n};
  Object[] b = {null};
  Object[] c = {n};
  -> Arrays.equals(a, b) is true, because n.equals(null)
  -> Arrays.equals(b, a) is false, because null != n
  -> Arrays.equals(a, a) is true, because a == a
  -> Arrays.equals(a, c) is false, because !n.equals(n)

Or to quote The Dude: "Fuck it man, let's go bowling"


Tuesday, 19 May 2015

Degrees Of (Lock/Wait) Freedom

Yo, Check the diagonal, three brothers gone...
I've been throwing around the terms lock-free and wait-free in the context of the queues I've been writing, perhaps too casually. The definition I was using was the one from D. Vyukov's website (direct quote below):
  • Wait-freedom:  Wait-freedom means that each thread moves forward regardless of external factors like contention from other threads, other thread blocking. Each operations is executed in a bounded number of steps. 
  • Lock-freedom: Lock-freedom means that a system as a whole moves forward regardless of anything. Forward progress for each individual thread is not guaranteed (that is, individual threads can starve). It's a weaker guarantee than wait-freedom. [...] A blocked/interrupted/terminated thread can not prevent forward progress of other threads. Consequently, the system as a whole undoubtedly makes forward progress.
  • Obstruction-freedom: Obstruction-freedom guarantee means that a thread makes forward progress only if it does not encounter contention from other threads. That is, two threads can prevent each other's progress and lead to a livelock. 
  • Blocking Algorithms: It's the weakest guarantee - basically all bets are off, the system as a whole may not make any forward progress. A blocked/interrupted/terminated thread may prevent system-wide forward progress infinitely. 
The above definitions refer to "forward progress/moving forward" in the context of "system" and "thread". In particular they translate "X-freedom" to "a guarantee that T/S makes forward progress".
Now "thread" is a well defined term, but what counts as forward progress of a given thread?
To my mind a thread which is free to return from a method is free to 'make progress'. For example: if the next element in a queue is not visible to a consumer thread I can return control to the consumer thread which is then free to make progress on anything it feels like and try getting an element out of the queue later. Similarly a producer thread which is unable to add elements to a queue is 'free' to 'make progress' even if the queue is full. In short, I interpreted 'freedom' as regaining control of execution.

Freedom? yeah, right...

My interpretation however is limited to the scope of the thread and assumes it has other things to do (so placing the thread within the context of a given 'system'). So the freedom to make progress is really a freedom to make progress on 'other things'. The definitions above when applied to a given data structure have no knowledge of the 'system' and it is therefore fair to assume nothing. And so if the 'system' is viewed as being concerned only with using the data structure, it seems my view of 'regained control freedom' is not inline with the 'progress making freedom'.
Let's consider for example the linked Multi-Producer-Single-Consumer queue discussed in the previous post (this is the not the j.u.Queue compliant version):

Now let us assume a producer has stalled in that unpleasant gap between setting the head and pointing prev to the new head(at line 34), this is what I've come to think of as a 'bubble' in the queue. The next producer will see the new head, but the consumer will not see it (or any nodes after it) until prev.next is set.
Others producers can 'make progress' in the sense that elements will get added to the queue. Those elements visibility to the consumer however is blocked by the 'bubble'. What is the correct degree of freedom for these producers? what about the consumer?
When describing the queue Vyukov makes the following statements:
  • Wait-free and fast producers. One XCHG is maximum what one can get with multi-producer non-distributed queue.
  • Push [offer in the Java version] function is blocking wrt consumer. I.e. if producer blocked in (* [line 34]), then consumer is blocked too. Fortunately 'window of inconsistency' is extremely small - producer must be blocked exactly in (* [line 34]).
So producers are wait-free, but consumer is blocking. The consumer is truly blocked by the 'bubble' and can make no progress in the terms of the data structure. Despite other producers adding nodes the consumer cannot see those nodes and is prevented from consuming the data. The fact that control is returned to the caller is considered irrelevant. And if we are being precise we should call this a blocking queue.

Life imitates Art

The "Art of Multiprocessor Programming" offers a different definition:
"A concurrent object implementation is wait free if each method call completes in a finite number of steps. A method is lock-free if it guarantees that infinitely often some method call finishes in a finite number of steps" - page 99
This sentiment is similarly echoed in "Wait-Free Queues With Multiple Enqueuers and Dequeuers" a paper on lock free queues:

  • [...] to ensure that a process (or a thread) completes its operations in a bounded number of steps, regardless of what other processes (or threads) are doing. This property is known in the literature as (bounded) wait-freedom.
  • lock-freedom ensures that among all processes accessing a queue, at least one will succeed to finish its operation.

Hmmm... no system, no progress. The boundaries are now an object and a method, which are well defined. I think this definition matches my understanding of 'regained control freedom' (open to feedback). If a method returns, progress or no progress, the condition is fulfilled. Under this definition the queue above is wait-free.
The wiki definition is again closer to the one outlined by Vyukov:
  • An algorithm is called non-blocking if failure or suspension of any thread cannot cause failure or suspension of another thread for some operations. A non-blocking algorithm is lock-free if there is guaranteed system-wide progress, and wait-free if there is also guaranteed per-thread progress.
  • Wait-freedom is the strongest non-blocking guarantee of progress, combining guaranteed system-wide throughput with starvation-freedom. An algorithm is wait-free if every operation has a bound on the number of steps the algorithm will take before the operation completes.
  • Lock-freedom allows individual threads to starve but guarantees system-wide throughput. An algorithm is lock-free if it satisfies that when the program threads are run sufficiently long at least one of the threads makes progress (for some sensible definition of progress). All wait-free algorithms are lock-free.
We are back to a fuzzy definition of system, progress, but the underlined sentence above is interesting in highlighting suspension. In particular I'm not sure how this is to be interpreted in the the context of the lock-free definition where suspension is not an issue if some threads can keep going.
Note that once we have fixed poll to behave in a j.u.Queue compliant way: The poll method is no longer wait-free by any definition.

What about the JCTools queues?

JCTools covers the full range of MPMC/MPSC/SPMC/SPSC range of queues, and aims for high performance rather than strict adherence to any of the definitions above. To be more precise in the definitions I would say:
  • SpscArrayQueue/SpscLinkedQueue are Wait Free (on both control and progress senses)
  • MpscArrayQueue is lock free on the producer side and blocking on the consumer side. I'm planning to add a weakPoll method which will be wait free in the control sense.
  • MpscLinkedQueue is wait free on the producer side and blocking on the consumer side. I'm planning to add a weakPoll method which will be wait free in the control sense.
  • SpmcArrayQueue is lock free on the consumer side and blocking on the producer side. I'm planning to add a weakOffer method which will be wait free in the control sense.
  • MpmcArrayQueue is blocking on the both producer and consumer side. I'm planning to add a weakOffer/Poll method which will be lock free in the control sense.
Does it matter that the queues do not meet the exact definitions set out below? As always it depends on your needs...

Non-Blocking Algorithms On The JVM?

{Full disclosure: please note when reading the next section that I work with Azul Systems on the Zing JVM, I'm not trying to sell anything, but I am naturally more familiar with it and consider it awesome :-)}
What happens when we include the JVM in our definition of a system? Can you build a non-blocking algorithm on the JVM? All the JVMs I know of are blocking in some particular cases, important to the discussion above are:
  1. Allocation: The common case for allocation is fast and involves no locking, but once the young generation is exhausted a collection is required. Young generation collections are stop the world events for Oracle/OpenJDK. Zing has a concurrent young generation collector, but under extreme conditions or bad configuration it may degrade to blocking the allocator. The bottom line is that allocation on the JVM is blocking and that means you cannot consider an algorithm which allocates non-blocking. To be non-blocking a system would have to provably remove the risk of a blocking collection event on allocation.
  2. Deoptimization: Imagine the JIT compiler in it's eager and aggressive compilation strategy has decided your offer method ends up getting compiled a particular way which ends up not working out (assumptions on inheritance, class loading, passed in values play a part). When the assumption breaks a deoptimization may take place, and that is a blocking event. It is hard to prove any piece of code is deoptimization risk free, and therefore it is hard to prove any Java code is non-blocking. Deoptimization is in many cases a warmup issue. Zing is actively battling this issue with ReadyNow which reloads previously recorded compilation profiles for the JVM, greatly reducing the risk of deoptimization. Oracle/OpenJDK users can do a warmup run of their application before actually using it to reduce the risk of deoptimization. My colleague Doug Hawkins gave a great talk on this topic which will give you far more detail then I want to go into here :-). 
  3. Safepoints: If your algorithm includes a safepoint poll it can be brought to a halt by the JVM. The JVM may bring all threads to a safepoint for any number of reasons. Your method, unless inlined, already has a safepoint on method exit(OpenJDK) or entry(Zing). Any non-counted loop will have a safepoint-poll, and this includes your typical CAS loop. Can you prevent safepoints from ever happening in your system? It might be possible for a JVM to provide users with compiler directives which will prevent safepoint polls from being inserted in certain code paths, but I don't know of any JVM offering this feature at the moment.
In short, if you are on the JVM you are probably already blocking. Any discussion of non-blocking algorithms on the JVM is probably ignoring the JVM as part of the 'system'. This needs to be considered if you are going to go all religious about wait-free vs. lock-free vs. blocking. If a system MUST be non-blocking, it should probably not be on the JVM.

Summary

Any term is only as good as our ability to communicate with others using it. In that sense, splitting hairs about semantics is not very useful in my opinion. It is important to realize people may mean any number of things when talking about lock-free/wait-free algorithms and it is probably a good idea to check if you are all on the same page.
I personally find the distinction between control/progress freedom useful in thinking about algorithms, and I find most people mean lock/wait-free excluding the effects of the JVM...
Thanks Martin & Doug for the review, feedback and discussion, any remaining errors are my own (but their fault ;-) )

Wednesday, 22 April 2015

Porting Pitfalls: Turning D.Vyukov MPSC Wait-free queue into a j.u.Queue

{This post is part of a long running series on lock free queues, checkout the full index to get more context here}
D. Vyukov is an awesome lock-free dude, and I often refer to his instructive and invaluable website 1024cores.net in my posts. On his site he covers lock free queue implementations and in particular a wait-free MPSC linked node queue. This is really rather special when you consider that normally MP would imply lock-free rather than wait-free guarantees. I've ported his algorithm to Java (and so have many others: Netty/Akka/RxJava etc.), and had to tweak it to match the Queue interface. In this post I'd like to explain the algorithm, it's translation to Java, and the implications of making it a j.u.Queue.

Lock free vs. Wait free

Let's review the definitions:
  • Wait-free: thread progress is guaranteed, all operations finish in a determined number of steps.
  • Lock-free: global progress is guaranteed, though a particular thread may be blocked.

An example of a transition from lock free to wait free is available with JDK8 changes to AtomicReference::getAndSet(). The change was made by utilizing the newly available Unsafe::getAndSetObject intrinsic which translates directly to XCHG (on x86). So where we used to have for AtomicReference:
T getAndSet(T newVal) {
  T currentValue;
  do {
  currentValue = val; // val is a volatile field
  } while (!Unsafe.compareAndSwapObject(this, VAL_FIELD_OFFSET, currentValue, newValue));
  return currentValue;
}
Now we have:
T getAndSet(T newVal) {
  return Unsafe.getAndSetObject(this, VAL_FIELD_OFFSET, newValue);
}
I discussed a similar change to AtomicLong.getAndAdd in a previous post, replacing the CAS loop with LOCK XADD.

The Vyukov Wait Free MPSC queue

This is a LinkedList type structure and the interesting methods are offer and poll, here's the original (I did some formatting):
Awesome in it's simplicity, deceptive in it's genius. Be aware that head/tail meaning is the other way around than what most people are used to. I personally go for producer/consumerNode (head = producer side, tail = consumer side in the above snippet) in my code, but for consistency I'll stick with Mr. Vs notation for the porting exercise.
But how do we manage the same memory barriers in Java? We can be nice about it and use the AtomicFieldUpdater or more brutal and use Unsafe. I find you get better performance with Unsafe, but you should consider how appropriate this is for you. In any case, here's what we end up with:
The code is pretty similar. Now if we wanted to complete the j.u.Queue we could extend AbstractQueue and implement only size()/peek() and we're done (I'm not going to bother with the iterator()):
There we go, seems reasonable don't it? off to the pub to celebrate some kick ass lock free programming!

Vyukov::poll() ain't Queue::poll!

I've discussed this particular annoyance in a previous post which has some overlap with this one. The names are the same, but as it turns out the guarantees are quite different. While Vyukov is doing a fine job implementing a queue, not any queue is a j.u.Queue. In particular for this case, the poll() method has a different contract:
  • j.u.Queue: Retrieves and removes the head of this queue, or returns null if this queue is empty.
  • Vyukov: Retrieves and removes the head of this queue, or returns null if next element is not available.
Why wouldn't the next element be available? Doesn't that mean the queue is empty? Sadly this ain't the case. Imagine for instance there are 2 producers and we run both threads step by step in a debugger. We break at line 33 (4th line of offer, the Java version):
  1. Producer 1: we step 1 line, we just replaced head with node n. We are suspended before executing line 35.
  2. Producer 2: we let the program continue. We've replaced head and linked it to the previous head.
What state are we in? Let's look at head/tail nodes and where they lead (I number the nodes in order of assignment to head):
  • head = Node[2], this is the node created by Producer 2. We also know that Node[1].next = Node[2], because we let offer run it's course on Producer 2.
  • tail = Node[0] the node we allocated in the constructor. This is the node head was before the first producer came along. This is what prev is equal to for Producer 1, but because we suspended that thread it never set it's next value. Node[0].next is still null!

If a consumer came along now they would get a null from poll(), indicating the queue is empty. But the queue is obviously not empty!
So it seems we cannot deduce the queue is empty from looking at tail.next here's 2 valid indicators that the queue is empty:
  • head == tail : this is the starting point set in the constructor and where the consumer ends up after consuming the last element
  • head.val == null : head can only have a value of null if it is tail
Here's a solution to a correct poll() and the knock on effect on peek():
This is a bit annoying because now we have a wait-free offer(), but poll() and peek() are lock-free(only block one thread, producers can make progress).
This pitfall is tricky enough that not only did I fall for it (on another queue algorithm, but same mistake), it also took by surprise Mr. Manes writing an interesting variant on this queue for his high performance cache implementation (I filed an issue which he promptly fixed), and struck the notorious Dave Dice when considering Vyukov's MPMC queue (see comments for discussion).
So is this it? Are we done?
Almost... size() is still broken.

A Good Size

It's not really surprising that size is broken given the terminating condition for the size loop was relying on next == null to terminate the count. Size was also broken in 2 other subtle ways:
  • The interface for size dictates that it returns a positive int. But given that the queue is unbounded it is possible (though very unlikely) for it to have more than 2^31 elements. This would require that the linked list consume over 64GB (16b + 8b + 8b=32b per element, refs are 8b since this requires more than 32GB heap so no compressed oops). Unlikely, but not impossible. This edge condition is handled for ConcurrentLinkedQueue (same as here), and for LinkedBlockingQueue (by bounding it's size to MAX_INT, so it's not really unbounded after all) but not for LinkedList (size is maintained in an int and is blindly incremented).
  • Size is chasing a moving target and as such can in theory never terminate as producers keep adding nodes. We should limit the scope of the measurement to the size of the queue at the time of the call. This is not done for CLQ and should perhaps be considered.
Here's the end result:
If you need a full SPSC/MPSC linked queue implementation they are available in JCTools for your pleasure, enjoy!

Special thanks to Doug and Ben for reviewing!

Monday, 13 April 2015

On Arrays.fill, Intrinsics, SuperWord and SIMD instructions

{This post turned rather long, if you get lazy feel free to skip to the summary}
Let's start at the very beginning, a very good place to start... My very first post on this blog was a short rant on intrinsics, and how they ain't what they seem. In that post I made the following statement:
"intrinsic functions show up as normal methods or native methods"
Which is correct. An intrinsic function is applied as a method substitution. A method call will appear in the code and the compiler will replace it's apparent source-level implementation with a pre-cooked implementation. In some cases intrinsics are sort of compilation cheats, the idea being that some bits of functionality are both very important (i.e. worth while optimizing) and can benefit from a hand crafted solution that will be better than what the compiler can achieve. The end result can be in one of a few flavours:
  1. Method call replaced with a call to a JVM runtime method: E.g. System.arrayCopy is replaced with a call to a method stub generated by the runtime for all array types. This method call is not a JNI call, but it is a static method call that is not inlined.
  2. Method call replaced with one or more instructions inlined: E.g. Unsafe.getByte/compareAndSet/Math.max
  3. Method call replaced with compiler IR implementation: E.g. java.lang.reflect.Array.getLength
  4. A mix of the above: E.g. String.equals is partially implemented in IR, but the array comparison is a call to a method stub.
The intrinsics are all set up in vmSymbols.hpp and if you look, you'll see Arrays.fill is NOT on the list. So why am I talking about Chewbacca? Because it is something like an intrinsic...

The Arrays.fill SIMD Opportunity

Arrays.fill is the Java memset (fills an array with a given value), and just like System.arrayCopy (memcpy in C lingo) is worth the effort to optimize and offers the same kind of opportunity. What opportunity might that be, you ask? the opportunity to use SIMD (Single Instruction Multiple Data) instructions when the underlying CPU offers them (I assume for the sake of discussion AVX enabled CPUs i.e. since Sandy Bridge, I find this listing of intel intrinsics useful to explain and sort through the available instructions). These instructions allow the CPU to operate on up to 256 bit (512 bit soon) chunks of data, thus transforming 32 byte sized MOV instructions into a single wide MOV instruction (E.g. the intel C instrinsic  _mm256_storeu_si256 or the corresponding instruction vmovdqu). SIMD instructions are good for all sorts of operations on vectors of data, or arrays, which is why the process of transforming element by element operations into SIMD instructions is also referred to as vectorization.
The actual assembly stub is generated dependent on CPU and available instruction set. For x86 the code is generated by the macroAssembler_x86.cpp, and the observant digger into the code will find it makes use of the widest memory instructions it can identify the processor is capable of. Wider is better baby! If you are not morbidly curious about what the implementation looks like, skip the next wall of assembly and you'll be back in Java land shortly.
Here's what the assembly boils down to when UseAVX>=2/UseSSE>=2/UseUnalignedLoadStores=true:

Roughly speaking the algorithm above is:
  1. Fill up an XMM register with the intended value
  2. Use the XMM register to write 64 byte chunks (2 vmovdqu) until no more are available
  3. Write leftover 32 byte chunk (skipped if no matching leftovers)
  4. Write leftover 8 byte chunks (skipped if no matching leftovers)
  5. Write leftover 4 bytes (skipped if no matching leftovers)
  6. Write leftover 2 bytes (skipped if no matching leftovers)
  7. Write leftover 1 bytes (skipped if no matching leftovers)
It ain't nice, but we do what we gotta for performance! There are variations of the above described across the internets as the done thing for a memset implementation, this might seem complex but is pretty standard... anyway, moving right along.

The Arrays.fill 'intrinsic'

Arrays.fill is different from System.arrayCopy because, as it's absence from vmSymbols suggests, it's not a method substitution kind of intrinsic (so technically not an intrinsic). What is it then? Arrays.fill is a code pattern substitution kind of compiler shortcut, basically looking for this kind of loop:
And replacing it with a call into the JVM memset implementation (I recently learnt the same thing is done by GCC as well, see code to assembly here). The pattern matching bit is done in loopTransform.cpp. This feels enough like an intrinsic grey area that the method doing the pattern match and replace is called intrinsify_fill.
Pattern matching makes this optimization potentially far more powerful than method substitution as the programmer doesn't have to use a special JDK method to convey meaning, they can just express their meaning in code and the compiler 'knows' that this simple loop means 'fill'. Compare that with System.arrayCopy where rolling your own leads to performance that is much worse than that offered by the intrinsic.
Let's prove me right (my favourite thing, beats kittens and all that crap), here's a JMH (see the JMH reference page for more JMH info/examples) benchmark comparing Arrays.fill to a hand rolled fill, and System.arrayCopy to handrolled array copy:
And the results are (Oracle JDK8u40/i7-4770@3.40GHz/Ubuntu, array is 32K in size)?
ArrayFill.fillBytes                561.540 ± 10.814 ns/op
ArrayFill.manualFillBytes          557.901 ± 5.255  ns/op
ArrayFill.manualReversedFillBytes 1017.856 ± 0.425  ns/op
ArrayFill.copyBytes               1300.313 ± 13.482 ns/op
ArrayFill.manualCopyBytes         1477.442 ± 13.030 ns/op

We can verify that the call out to the JVM fill method happens for fillBytes/manualFillBytes by printing out the assembly:

So what have we learnt so far:
  • Use System.arrayCopy, it is better than your handrolled loop. But surprisingly not hugely better, hmmm.
  • You don't have to use Arrays.fill, you can roll your own and it works the same. Notice the call out to the fill method. But...
  • Don't get too creative rolling your own. If you get too funky (like filling the array backwards) it'll fall apart and the 'intrinsic' won't happen. But do note that the reverse fill still has some of that good SIMD stuff going, we'll get to that in a sec.

Are The Other Types Filling The Love?

It all sounds great don't it? Let's see how this pans out for other types. We'll be filling an array of 32KB. To be uniform across data types that means a 16K chars/shorts array, an 8K ints/floats array and a 4K array of longs. I added an 8K array of objects, which is the same size for compressed oops on the Oracle JVM (reference size is 4 bytes, same as an int).
The JMH benchmark code is as you'd expect:
Here's some reasonable expectations:
  • If no optimizations are present, wider writes are more efficient. It follows that the longFill would be the fastest. But...
  • Given a clever compiler the fill loop is replaced with the widest writes possible, so there should be no significant difference. But the fill optimization does not cover double/long/object arrays, so we might expect longFill to be the worst performer.
  • An objects array is not that different from an int array, so performance should be similar. Sure there's a write barrier, but it need only be done once per card (not once for the whole array as I thought initially, god bless Shipilev and PrintAssembly), so that's an extra byte write per card of elements filled. A card is per 512 bytes, each element is 4 bytes, so that's one card per 128 elements. Given there is no fill method implemented for it we may expect it to be slightly worse than the longFill.
  • We should not rely on expectations, because performance is best measured.
As you'd expect the results are somewhat different than the expectations (Oracle JDK8u40/i7-4770@3.40GHz/Ubuntu):
ArrayFill.fillBytes     561.540 ± 10.814  ns/op
ArrayFill.fillChars     541.901 ±  4.833  ns/op
ArrayFill.fillShorts    532.936 ±  4.508  ns/op
ArrayFill.fillInts      543.165 ±  3.788  ns/op
ArrayFill.fillFloats    537.599 ±  2.323  ns/op
ArrayFill.fillLongs     326.770 ±  3.267  ns/op
ArrayFill.fillDoubles   346.840 ±  5.786  ns/op
ArrayFill.fillObjects  4388.242 ± 11.945  ns/op

Say WOT?
For bytes/chars/shorts/ints/floats Arrays.fill performs very similarly. This much is as expected from the second point above. But filling an array of longs/doubles is better than the others. The funny thing is, there's no fill function implemented for the long array, how come it is so darn quick? Also, why does the objects fill suck quite so badly when compared with the rest (I will not be addressing this last question! I refuse! this post is too fucking long as it is!)?
This is what happens when we turn off the OptimizeFill flag:
ArrayFill.fillBytes    1013.803 ± 0.227  ns/op
ArrayFill.fillChars     323.806 ± 3.879  ns/op
ArrayFill.fillShorts    323.689 ± 4.499  ns/op
ArrayFill.fillInts      326.336 ± 1.559  ns/op
ArrayFill.fillFloats    319.449 ± 2.048  ns/op
ArrayFill.fillLongs     328.692 ± 3.282  ns/op
ArrayFill.fillDoubles   345.035 ± 6.362  ns/op
ArrayFill.fillObjects  4397.130 ± 7.161  ns/op

Strange innit? now we got char/int/long arrays all performing similarly. In fact, with the exception of the byte array, everything is better than it was with the optimization.


Superword to the rescue! 

Turns out the JIT compiler is clued up on the topic of SIMD parallelisation by way of Superword Level Parallelism (see the original paper here):
In some respects, superword level parallelism is a restricted form of ILP (Instruction Level Parallelism). ILP techniques have been very successful in the general purpose computing arena, partly because of their ability to find parallelism within basic blocks. In the same way that loop unrolling translates loop level parallelism into ILP, vector parallelism can be transformed into SLP. This realization allows for the parallelization of vectorizable loops using the same basic block analysis. As a result, our algorithm does not require any of the complicated loop transformations typically associated with vectorization. In fact, vector parallelism alone can be uncovered using a simplified version of the SLP compiler algorithm.
...
Superword level parallelism is defined as short SIMD parallelism in which the source and result operands of a SIMD operation are packed in a storage location.
...
Vector parallelism is a subset of superword level parallelism.
The Hotspot compiler implements SLP optimizations in superword.cpp and you are invited to dive into the implementation if you like. I'm going to focus on it's impact here, and to do that I only need to know how to turn it on and off (core competency for any software person). It's on by default, so above results are what happens when it is on, here's what life looks like when it is off too (so -XX:-OptimizeFill -XX:-UseSuperWord):
ArrayFill.fillBytes   8501.270 ±  2.896  ns/op
ArrayFill.fillChars   4286.161 ±  4.935  ns/op
ArrayFill.fillShorts  4286.168 ±  3.146  ns/op
ArrayFill.fillInts    2152.489 ±  2.653  ns/op
ArrayFill.fillFloats  2140.649 ±  2.587  ns/op
ArrayFill.fillLongs   1105.926 ±  2.228  ns/op
ArrayFill.fillDoubles 1105.820 ±  2.393  ns/op
ArrayFill.fillObjects 4392.506 ± 11.678  ns/op


Life is revealed in all it's sucky splendour! This is what happens when the compiler shows you no love... did I say no love? hang on, things can get a bit worse.

Detour: Unsafe? We don't serve you kind here

To all the Unsafe fans, I got some sad news for y'all. Unsafe 'fill' loops are not well loved by the compiler. This is the price of stepping off the beaten path I guess. Consider the following benchmark:
The results are:
ArrayFill.unsafeFillOffheapBytes  9742.621 ±  2.270  ns/op
ArrayFill.unsafeFillOnHeapBytes  12640.019 ±  1.977  ns/op
ArrayFill.fillBytes(for reference) 561.540 ± 10.814 ns/op

The Unsafe variant do not enjoy the 'fill' pattern matching magic, nor do they get the SuperWord optimizations. What can you do? For this kind of thing you should use the Unsafe.setMemory method instead:
With the result:
ArrayFill.unsafeSetOffheapBytes   1259.281 ± 21.294  ns/op
ArrayFill.unsafeSetOnHeapBytes    1275.158 ± 27.950  ns/op
Not quite there, still ~2x worse (why? how come it doesn't just call the bytes fill method? a bit of digging shows it ends up calling the underlying platform's memset...) but beats being 20-25x worse like the handrolled method is.

Summary and Musings

It's the circle of life!
So what did we learn:
  • There's another kind of 'intrinsic' like optimization, which uses pattern matching to swap a block of code rather than a method. This is employed for memset like memory fill loops (in particular Arrays.fill) intrinsicfication. It's not an intrinsic technically, but you know what I fucking mean. 
  • System.arrayCopy/Arrays.fill implementations utilize SIMD instructions to improve their efficiency. These instructions are not available in plain Java, so some compiler intervention is required.
  • The JIT compiler is also able to use SuperWord Level Parallelism to derive SIMD code from 'normal' sequential code.
  • In the case of Arrays.fill, it looks like the SuperWord optimized code is faster than the fill specialized implementation for all types except bytes (on the system under test)
  • If you use Unsafe you will be excluded from these optimizations.
So I look at this process and I imagine history went something like this:
We want to use SIMD instructions, but the JIT compiler isn't really clever enough to generate them by itself. Memset implementations are rather specialized after all. Let's make life a bit easier for the compiler by creating an intrinsic. We'll even go the extra mile and make an effort to automatically identify opportunities to use this intrinsic, so now it's not really an intrinsic any more. The Arrays.fill optimization is available on Oracle JDK6u45 (the oldest I keep around, maybe it was there a while before that) and on that JVM it is twice as fast as the SLP generated code.
Over time, SLP gets better and eventually the compiler is now good enough to optimize the fill loop by itself and beat the specialized method. That is an awesome thing. We just need to remove the training wheels now.
And there's a final punch line to this story. Memset/Memcpy are such common and important opportunities for optimization, so Intel has decided to offer an assembly 'recipe' for them and save everyone the effort in writing them:
3.7.6 Enhanced REP MOVSB and STOSB operation (ERMSB)
Beginning with processors based on Intel microarchitecture code name Ivy Bridge, REP string operation using MOVSB and STOSB can provide both flexible and high-performance REP string operations for soft- ware in common situations like memory copy and set operations. Processors that provide enhanced MOVSB/STOSB operations are enumerated by the CPUID feature flag: CPUID:(EAX=7H, ECX=0H):EBX.ERMSB[bit 9] = 1. - [From the Intel Optimization Manual(September 2014)]
From the manual it seems that this method of implementing memcpy/memset can perform well, but like anything else, YMMV (the intel manual discussion of the performance differences is in itself interesting both on the results and the methodology level). One obvious advantage of this method is that it results in much much smaller code that should be trivial to inline into callers. This will however put the SuperWord method at a slight disadvantage, and the tide will change again.
[UPDATE 14/03/2015: It seems the good folks of Oracle have considered and rejected the use of REP MOVSB for array copy.]
Thanks go to the kind reviewers Peter 'Massive' Hughes, Darrach and the Shipster

Wednesday, 11 March 2015

Correcting YCSB's Coordinated Omission problem

YCSB is the Yahoo Cloud Serving Benchmark(also on wiki): a generic set of benchmarks setting out
The Nimbus Cloud Serving Board
to compare different key-value store providers under a set of loads:
The goal of the Yahoo Cloud Serving Benchmark (YCSB) project is to develop a framework and common set of workloads for evaluating the performance of different "key-value" and "cloud" serving stores.
The code is open for extension and contribution and all that good stuff, you can get it here. And it has become tool for comparing vendors in the NoSQL space. The benchmarks set out to measure latency and throughput. The terms are not directly defined in the paper, but the following statement is made:
The Performance tier of the benchmark focuses on the latency of requests when the database is under load. Latency is very important in serving systems, since there is usually an impatient human waiting for a web page to load. [...] Typically application designers must decide on an acceptable latency, and provision enough servers to achieve the desired throughput while preserving acceptable latency. [...] The YCSB Client allows the user to define the offered throughput as a command line parameter, and reports the resulting latency, making it straightforward to produce latency versus throughput curves.
What could possibly go wrong?™
It can go this wrong for instance, order of magnitude difference in results for different percentiles, leading to some poor decision making on how much hardware you'll need, leading to getting fired from your job and growing old bitter and twisted mumbling to yourself as you get drunk on the street corner until you freeze to death on a winter night. So potentially this is a risk to your future well being, listen up!


It's broken? Coordinated WOT?

When you measure latency bad,
Mr. Tene is sad
 :(
My colleague at Azul, Gil Tene, the magnificent, glorious, multi-dimensional, coding CTO, officer and gentleman (that's my bonus sorted) has been doing a lot of latency related preaching and teaching in the last few years. He has given the following talks at any number of conferences, but if you happened to have missed them, watch them NOW:
In particular he has coined the term "Coordinated Omission" (see raging discussion on Mechanical Sympathy) to describe instances in which the measuring/monitoring system coordinates measurement with the system under test/measurement such that samples are biased. This issue manifests in many load generating frameworks where the call into the system under test is done synchronously and the measurement thread holds up the next sample while the call is ongoing. This enables the system under test to delay requests that would have been made during the synchronous call thus skewing the sample set. Consider for example a system where:
  • We set out to measure a request every 1ms (from a single thread, synchronously)
  • The first 500 calls come back in 100µs each (so call K starts at Kms and returns at Kms + 100µs )
  • Call 501 takes 500 milliseconds (starts at 500ms, returns at 1 second)
  • Call 502 takes 100µs
See the problem?
The problem is that call 502 did NOT happen at it's designated time, and saying it took 100µs fails to capture this. It failed the assumptions laid out in the first sentence because we were blocked for 500ms. If we were to stick to our original schedule we would be making calls 502 to 1000 in the time it took for call 501 to execute. How should we treat this departure from plan?
Is it safe to come out yet?
  1. Ignore it! it will go away by itself! - This is the coordinated omission way. We are now reporting numbers that are no longer according to the test plan, which means that our "latency versus throughput curves" are off the mark. This is a very common solution to this issue.
  2. Fail the test, we wanted a call every ms and we didn't get that - This is an honest hardline answer, but it potentially throws the baby with the bath water. I think that if you set out to schedule 1000 calls per second you might want to see how often this falls apart and how. But this answer is The Truth™, can you handle it? If one is to start from scratch and write their own load generator I propose a read of the Iago load test framework philosophy page: "Iago accurately replicates production traffic. It models open systems, systems which receive requests independent of their ability to service them. Typical load generators measure the time it takes for M threads to make N requests, waiting for a response to each request before sending the next; if your system slows down under load, these load testers thus mercifully slow down their pace to match. That's a fine thing to measure; many systems behave this way. But maybe your service isn't such a system; maybe it's exposed on the internet. Maybe you want to know how your system behaves when N requests per second come in with no "mercy" if it slows down.". This is a fine sentiment.
  3. Coordinated Omission correction: Adjust the results to reflect the expected call rate. This can be done in a straight forward manner if the 'missing' calls are added back with a latency which reflects the period for which they were delayed. This correction method is supported out of the box by HdrHistogram but the discussion regarding it's over or under estimation of the impact of the delay is outside the scope of this post.
  4. Coordinated Omission avoidance: Measure all calls according to original schedule. We are now saying: "If I can't make the call, the meter is still running!". This is particularly relevant for systems where you would typically be making the requests to the system under test from a thread pool. That thread pool would be there to help you support asynchronous interaction where the API failed in giving you that option. Like JDBC... Like many key-value pair provider APIs.
This last solution is the one we'll go for in this post, but I would urge you to consider the results critically. In particular if you are trying to simulate independent access to a web server (as opposed to a DB via a thread pool) then the adherence to schedule might be hugely optimistic of the results in your case. This is because failing to generate independent load may have all sorts of beneficial effects on the system under test.
For the YCSB benchmark I'm assuming the harness/load generator is simulating a web serving layer accessing the key-value store in an effort to serve an unbounded, uncoordinated user request load via a predefined thread pool. So it's door number 3 for me. The corrected load generator is here.


Step 0: Some preliminary work (not strictly required)

HdrHistogram, as approved by Freddie!
As described previously here, we should all just get on with capturing latency using HdrHistograms. So as a first step toward correcting YCSB I have gone in and added an HdrHistogram measurement container. This is pretty straight forward as all I needed to modify was the Measurements class to allow a new measurement type. While I was there I tweaked this and that and the following list of changes to that class emerged:
  1. Add new measurement type and corresponding command-line option("-p measurementtype=hdrhistogram")
  2. Add combined measurement option allowing old/new measurement side by side: "hdrhistogram+histogram"
  3. Add support for capturing both corrected and uncorrected measurements for the same run.
  4. Use CHM instead of synchronizing around a HashMap.
The new measurement type supports logging loss less HdrHistogram data to a file (controlled by the hdrhistogram.fileoutput=<true|false> option and the hdrhistogram.output.path=<path> option) as well as better precision percentile data and lock free logging of latencies. This is not very interesting work but if you are interested in the "How would I plug in my own data structure to capture latency into YCSB?" topic have fun. It was not necessary for correction but it was good to do so that better quality results can be observed. You're welcome.


Step 1: Demonstrate the issue

YCSB includes a very useful means of verifying the measurements in the form of a mock DB driver. This means we can test our assertions regarding coordinated omission without setting up a key value store of any kind. The mock DB is called BasicDB and is the default DB used. We can configure it to simulate a pause and see what happens (-p basicdb.verbose=false -p basicdb.simulatedelay=4  will make the mock DB stop logging every action and simulate a latency of 0-4ms for each action). I added a further option to the BasicDB which allows us to turn off the randomization of the delay (-p basicdb.randomizedelay=false).
Let's consider our expectations in the case where a DB simply cannot handle request quickly enough. We can setup an experiment with the following settings: -target 1000 -threads 1 -s -p status.interval=1 -p workload=com.yahoo.ycsb.workloads.CoreWorkload -p basicdb.verbose=false -p basicdb.simulatedelay=4 -p basicdb.randomizedelay=false -p measurementtype=hdrhistogram -p maxexecutiontime=60
Here's what they all mean:
  • -target 1000 -> We aim to test 1000 requests per second
  • -threads 1 -> We have a single client thread
  • -s -p status.interval=1 -> We will be printing out status every second (I made the status interval configurable)
  • -p basicdb.verbose=false -p basicdb.simulatedelay=4 -p basicdb.randomizedelay=false -> The DB will sleep 4ms on each request, so the maximum we can hope for is 250, no noisy printing per operation please
  • -p measurementtype=hdrhistogram -> Use HdrHistogram to capture the latencies
  • -p maxexecutiontime=60 -> Run for one minute, then exit and print summary
This DB is obviously failing, it can't keep up with the rate of incoming requests and according to our model they queue up. The time measured per call is reflected in the summary for the READ operations:
[READ], Operations, 12528.0
[READ], AverageLatency(us), 4477.102809706258
[READ], MinLatency(us), 4018.0
[READ], MaxLatency(us), 44703.0
[READ], 95thPercentileLatency(ms), 4.0
[READ], 99thPercentileLatency(ms), 4.0
[READ], Return=0, 12528
But this completely ignores the time spent on the queue. If we were measuring according to schedule we'd get the following set of latencies:
Latency[k] = 4 + 3*(k-1) ms
The max latency would be for the last request to get in. We ran for 60 seconds, at 250 requests/sec which means our last request was (k=15000) and had a latency of 45 seconds when measured from the time we intended to make it. This number reflects the system's failure to handle load far more correctly than the numbers quoted above.


Step 2: Working to Schedule

The YCSB load generator has a weak notion of schedule, in the sense that it opts for option number 1 above and will just execute the operations when it can. When faced with the task of correcting this kind of issue (in a pile of foreign code) we need to look for 2 things in the load generator:
  1. "Scheduling an action to run at time X" - This will involve some calls to one of the many scheduling facilities in the JDK:
    1. Thread.sleep is an old favourite, but TimeUnit also supports a sleep method. A search for sleep in the code base will cover both. This is what YCSB was using to schedule next event to fire.
    2. Code submitting tasks to java.util.Timer, or alternatively the ScheduledExecutorService
    3. Code using LockSupport.parkNanos
    4. Object.wait(...)
    5. others?
  2. "Measuring the length of an operation" - This will involve calls to System.nanoTime() or currentTimeMillis(). For YCSB this is found to happen for example here.

To correct this problem I had to introduce the concept of 'intended start time' to the operations measurement. Schedule for YCSB is specified by the -target command line option which sets the overall number of operations per second to be attempted by the load generator. This is optional, and the default is to go as fast as you can manage, i.e. with no schedule but the back pressure from the system under test to guide us. I'm not sure what a good assumed rate of requests is reasonable in this case, so I did not correct this case. NOTE: If you don't specify target no correction will take place.
The target parameter is translated to a per-thread operation rate (number of threads is set via the threads option, default is 1) so if we have 10 threads, and the target request rate is 1000 (-target 1000 -threads 10) we will have each thread hitting the store with 100 requests per second. The client threads randomize the first operation time to avoid all hitting the store on the same interval. I did some ground work here by setting the units across the board to nanoseconds and naming interval parameters appropriately, nothing too exciting.
The actual correction at it's core involves:
  1. Record the operation's intended start time
  2. Use the intended start time when computing latency
Sadly the way YCSB measures latency does not lend itself to a simple in place fix. The operations are scheduled by the ClientThread which calls into a workload, calling into a DB, which is actually the DBWrapper which measures the latency (for calling into an actual DB implementation) and reports it to the central Measurements singleton. This means that changing the Workload/DB API to include a startTime parameter to each call is quite a far reaching change which would require me to dig through all the DB drivers implementations and would result in a very unpleasant time for all.
I settled on using a thread local on the Measurements object to transfer the start time to the DBWrapper, it is not a nice way to do things (and I'm happy to hear better suggestions) but it does the job without modifying the API.
Once we have:
  1. ClientThread setting up the start time for the operation via Measurements
  2. DBWrapper using the start time from Measurements to compute the operation latency
That's pretty much it. For extra points I wanted to include some facilities to compare measurements before/after the change. These can be removed if we accept HdrHistogram as a replacement and if we accept we only want to measure the intended latency, which would result in a much smaller PR.


Step 3: is the issue solved?

Running the setup from step 1 such that it produces the intended latency as well as the original measurement side by side(-p measurement.interval=both) yields the following result for the READ operations:
[READ], Operations, 12414.0
[READ], AverageLatency(us), 4524.981069759949
[READ], MinLatency(us), 4018.0
[READ], MaxLatency(us), 24703.0
[READ], 95thPercentileLatency(ms), 4.0
[READ], 99thPercentileLatency(ms), 4.0
[READ], Return=0, 12414
[Intended-READ], Operations, 12414.0
[Intended-READ], AverageLatency(us), 2.359010991606251E7
[Intended-READ], MinLatency(us), 4256.0
[Intended-READ], MaxLatency(us), 4.6989311E7
[Intended-READ], 95thPercentileLatency(ms), 42369.0
[Intended-READ], 99thPercentileLatency(ms), 46530.0

This reflects the effect a backed up system would have on latency as we express in Step 1 above. It's actually a bit worse because the average cost of calling the mock DB with a sleep of 4ms is 4.5ms. As we can see the maximum latency is 46.9 seconds, reflecting the fact that the last read to execute was scheduled to hit the system 13.1 seconds into the run.


Step 4: The limitations of the harness

We can now also consider the perfect DB for the sake of observing the short comings of the test harness by setting the mock DB delay to 0(-p basicdb.simulatedelay=0):
[READ], Operations, 56935.0
[READ], AverageLatency(us), 0.01796785808377975
[READ], MinLatency(us), 0.0
[READ], MaxLatency(us), 49.0
[READ], 95thPercentileLatency(ms), 0.0
[READ], 99thPercentileLatency(ms), 0.0
[READ], Return=0, 56935
[Intended-READ], Operations, 56935.0
[Intended-READ], AverageLatency(us), 232.37026433652412
[Intended-READ], MinLatency(us), 0.0
[Intended-READ], MaxLatency(us), 39007.0
[Intended-READ], 95thPercentileLatency(ms), 0.0
[Intended-READ], 99thPercentileLatency(ms), 0.0

How come it take so long to measure a noop? why such large differences? Here's some generic theories and how they panned out:
  • The JVM running the load generator is running with suboptimal settings(-Xms64m -Xmx64m -XX:+PrintGCDetails -XX:+PrintGCApplicationStoppedTime, Oracle JDK8u31) on a busy Mac laptop running on battery
This is no way to benchmark anything, but the interesting thing is that if we have no schedule to stick to the test harness is willing to just ignore the issue. If we run on a decent machine (with a decent OS) we get nicer results. This is from a server class machine running CentOS6.3/OracleJDK8u25 with same settings:
[READ], Operations, 56930.0
[READ], AverageLatency(us), 0.44417705954681186
[READ], MinLatency(us), 0.0
[READ], MaxLatency(us), 20.0
[READ], 95thPercentileLatency(ms), 0.0
[READ], 99thPercentileLatency(ms), 0.0
[READ], Return=0, 56930
[Intended-READ], Operations, 56930.0
[Intended-READ], AverageLatency(us), 146.31262954505533
[Intended-READ], MinLatency(us), 15.0
[Intended-READ], MaxLatency(us), 14255.0
[Intended-READ], 95thPercentileLatency(ms), 0.0
[Intended-READ], 99thPercentileLatency(ms), 0.0
This is still significant.

  • The JVM suffers from warmup related artefacts
This certainly correlated to the max values I'm seeing here. When looking at the status line for the first second I see:
[READ: Count=22, Max=14, Min=0, Avg=0.64, 90=0, 99=14, 99.9=14, 99.99=14]
[Intended-READ: Count=23, Max=14255, Min=15, Avg=5563.39, 90=13287, 99=14255, 99.9=14255, 99.99=14255]
 But after a few seconds the process settles and we see much better results, this is typical:
[READ: Count=947, Max=14, Min=0, Avg=0.02, 90=0, 99=0, 99.9=2, 99.99=14]
[Intended-READ: Count=946, Max=194, Min=61, Avg=151.29, 90=165, 99=175, 99.9=186, 99.99=194]
A good way to handle this issue is by relying on the HdrHistogram output files to grab relevant time periods for analysis. With the original YCSB output we have the percentile summary data, but this is not something we can combine for analysis. With the loss-less interval histogram logs we can look at any sub-period(which is longer than one interval, but shorter than the whole run) and get accurate full range histogram data. A common practice is to discard warmup period results, I'm no a fan of throwing away data, but since this is the load generator warmup I'd think it's quite legitimate. It's perhaps an interesting feature to add to such a framework that the framework can be warmed up separately from the system to examine cold system behaviour.
  • Thread.sleep/LockSupport.parkNanos are not super accurate and may wakeup after the intended operation start time
I've added an option for spinning instead of sleeping (so burn a CPU). This has improved the average value dramatically from ~146µs to ~3.1µs. A typical status line now looks like:
[READ: Count=947, Max=13, Min=0, Avg=0.02, 90=0, 99=0, 99.9=3, 99.99=13]
[Intended-READ: Count=948, Max=57, Min=0, Avg=0.47, 90=1, 99=12, 99.9=26, 99.99=57]
It is obviously not desirable for the load generator to burn a CPU instead of using sleep, but it also introduces scheduling inaccuracies. This is an accuracy issue we didn't have to deal with when not measuring from a schedule. This didn't impact the measured outliers, but has dramatically reduced their number. The take away here is just that there are accuracy limitations to the load generators ability to stick to schedule.

  • GC pauses that are large enough to derail the schedule on the load generator side are now captured. Unless the GC pauses happen inside the measurement gap we will have no idea we have gone off schedule if we don't track the intended start time.
We should capture GC logs on load generator side and make sure we correlate the GC events with recorded latencies. Here's a GC pause being captured by the corrected measurement:
[READ: Count=952, Max=0, Min=0, Avg=0, 90=0, 99=0, 99.9=0, 99.99=0] 
[Intended-READ: Count=952, Max=14, Min=0, Avg=0.03, 90=0, 99=0, 99.9=3, 99.99=14]
[GC (Allocation Failure) [PSYoungGen: 17895K->1824K(18944K)] 17903K->1840K(62976K), 0.0024340 secs] [Times: user=0.01 sys=0.01, real=0.01 secs]
Total time for which application threads were stopped: 0.0026392 seconds
[READ: Count=957, Max=0, Min=0, Avg=0, 90=0, 99=0, 99.9=0, 99.99=0]
[Intended-READ: Count=957, Max=2719, Min=0, Avg=5.21, 90=0, 99=0, 99.9=2119, 99.99=2719]
This process is running with a 64M heap, you can expect longer pauses as the heap grows (in particular as the young generation grows).

  • The operation setup time is now being measured as well as the operation itself.
When running with the spin option we can see the average operation cost is ~3.1µs, this is all test harness overhead and is really quite negligible in the context of network hopping operations. In other words, nothing to worry about for this harness but could well prove an issue for others.


Step 5: The Good, The Bad And The STW pausing DB

Many software processes have a latency profile that is far from normally distributed. To see what YCSB makes of this kind of profile now that we can compare corrected vs. uncorrected measurement I have built a mock DB that has 4 mods of latency (p is uniform random number [0,1]):
  1. Awesome (p < 0.9): we return in 200µs-1ms
  2. Good (0.9 < p < 0.99): we return in 1-10ms
  3. Minor Hiccup( 0.99 < p < 0.9999): we hit a bump, but only one thread is affected 10-50ms
  4. Major Hiccup(0.9999 < p): we hit a STW pause(because GC/THP/LBJ/STD/others), all threads stop for 50-200ms
I implemented the above with a read write lock, where the STW pause grabs the write lock and all the others grab the read lock. This is far from a perfect representation of a system (everyone waits for STW as intended, but also STW waits for everyone to start), but it will do. If you feel strongly that a better simulation is in order, write one and let's discuss in the comments!
What sort of profile will we see? How far off course will our measurements be if we don't stick to schedule? Here's this setup run at a rate of 10,000 requests per second, with 25 threads (so each thread is trying for 250 reqs/sec or 1 request per 4ms):
[READ], Operations, 569516.0
[READ], AverageLatency(us), 1652.1852871561116
[READ], MinLatency(us), 210.0
[READ], MaxLatency(us), 142463.0
[READ], 95thPercentileLatency(ms), 1.0
[READ], 99thPercentileLatency(ms), 19.0
[READ], Return=0, 569516

According to these numbers, the max is quite high but the overall impact of hiccups is not too severe (all depends on your requirements of course). Even at this stage we can see that the effect of global pauses is skewing the other measurements (if you hit a short operation while a STW pause is in progress you still have to wait for the STW event to finish).
The corrected measurements tell a different story:
[Intended-READ], Operations, 569516.0
[Intended-READ], AverageLatency(us), 24571.6025835973
[Intended-READ], MinLatency(us), 268.0
[Intended-READ], MaxLatency(us), 459519.0
[Intended-READ], 95thPercentileLatency(ms), 83.0
[Intended-READ], 99thPercentileLatency(ms), 210.0


How can this be right? Can this be right?
  • At a rate of 10000 request per second, the unlikely Major Hiccup is likely to happen every second. Consider this next time someone tells you of a 99.99%ile behaviour. Given an event rate of 10K per second, 99.99% is suddenly not very rare. Consider that at this rate there's likely to be a few events that are worse.
  • The average major hiccup is 125ms long, in this time 125/4 events are delayed on all 25 threads -> 125 * 25 / 4 = 781 events are delayed from starting, they will further delay each other as they execute. In roughly 12 seconds we can see how it is quite probable that one of these events is another major hiccup. What with all the queuing up behind the first one etc, the pileup becomes quite reasonable.
  • The probability of a 'mode' is not the probability of the per event latency once STW and queuing effects are in play.
I've made the mock DB print out 'OUCH' every time we get slapped with a STW event. It turns out that we got very unlucky in this run and hit three of these in a row:
56 sec:
[READ: Count=9192, Max=83903, Min=238, Avg=1745.13, 90=1531, 99=26959, 99.9=79551, 99.99=83775]
[Intended-READ: Count=9208, Max=159999, Min=303, Avg=16496.92, 90=54271, 99=103807, 99.9=150527, 99.99=158335]
OUCH
OUCH
OUCH
57 sec: 
[READ: Count=9642, Max=129727, Min=247, Avg=2318, 90=1799, 99=40607, 99.9=125631, 99.99=127359]  
[Intended-READ: Count=9635, Max=459519, Min=320, Avg=102971.39, 90=200319, 99=374271, 99.9=442367, 99.99=457983]

This is quite telling.
The view on what's the worst second in this run is wildly different here. Because the uncorrected measurement takes each event as it comes it will take the view that 75 events were delayed by these hiccups, and none by more than 130ms. But from the corrected measurement point of view all the queued up measurements were effected and were further delayed by each other.
I've re-run, this time logging interval histograms in their compressed form for every second in the run. Logging a 60 seconds run with 1 second interval data cost me 200k (we can tweak the construction in OneMeasurementHdrHistogram to minimize the cost). I can take the compressed logs and use the HistogramLogProcessor script provided with HdrHistogram to process the logs (you need to copy the HdrHistogram.jar into the script folder first). Running:
./HistogramLogProcessor -i READ.hdr -o uncorrected -outputValueUnitRatio 1000
./HistogramLogProcessor -i Intended-READ.hdr -o corrected -outputValueUnitRatio 1000
Will produce *.hgrm files for both. I then use the plotFiles.html to generate the following comparison:


They tell very different stories don't they.
The red line will have you thinking your system copes gracefully up to the 99%ile slowly degrading to 20ms, when measuring correctly however the system is shown to degrade very quickly with the 20ms line crossed as early as the median, and the 99%ile being 10 times the original measurement. The difference is even more pronounced when we look at one of those terrible seconds where we had back to back STW hiccups. I can use the HistogramLogProcessor script to produce partial summary histograms for the 3 seconds around that spike:
./HistogramLogProcessor -i Intended-READ.hdr -o correctedOuch3 -outputValueUnitRatio 1000 -start 1425637666.488 -end 1425637668.492
Similarly we can compare a good second with no STW pauses:


Summary

Coordinated Omission is a common problem in load generators (and other latency reporters), we had a look at fixing YCSB, an industry standard load generator:
  • Replaced the data structure used to capture latency with HdrHistogram: that is just generally useful and gives us better data to work with when examining the corrected measurement
  • Found scheduling code and introduced notion of operation start time.
  • Found measuring code and captured both operation cost (uncorrected measurement) and scheduled time latency (corrected measurement).
  • Use a mock system under test to evaluate measurement of known scenario. This is a very handy thing to have and luckily YCSB had this facility in place. In other places you may have to implement this yourself but it's a valuable tool to have in order to better understand the measurement capabilities of your harness. This helped highlight the scale of scheduling inaccuracies and test harness overhead per operation, as well as the scale of test harness error during its own warmup period.
  • Use HdrHistogram facilities to visualise and analyse latency histogram data from the compressed histogram logs.
Thanks goes to this posts kind reviewers: Peter Huges, Darach, and Philip Aston