Monday, 1 December 2014

The Escape of ArrayList.iterator()

{This post assumes some familiarity with JMH. For more JMH related content start at the new and improved JMH Resources Page and branch out from there!}
Escape Analysis was a much celebrated optimisation added to the JVM in Java 6u23:
"Based on escape analysis, an object's escape state might be one of the following:
  • GlobalEscape – An object escapes the method and thread. For example, an object stored in a static field, or, stored in a field of an escaped object, or, returned as the result of the current method.
  • ArgEscape – An object passed as an argument or referenced by an argument but does not globally escape during a call. This state is determined by analyzing the bytecode of called method.
  • NoEscape – A scalar replaceable object, meaning its allocation could be removed from generated code.
After escape analysis, the server compiler eliminates scalar replaceable object allocations and associated locks from generated code. The server compiler also eliminates locks for all non-globally escaping objects. It does not replace a heap allocation with a stack allocation for non-globally escaping objects." - from Java 7 performance enhancements 
Alas, proving an object never escapes is a difficult problem, and many people feel they cannot rely on this optimisation to kick in and "do the right thing" for them. Part of the problem is that there is no easy way to discover if a particular allocation has been eliminated (on a debug OpenJDK build one can use -XX:+UnlockDiagnosticVMOptions -XX:+PrintEscapeAnalysis -XX:+PrintEliminateAllocations).
The EscapeAnalysis skepticism leads some people to go as far as claim that the JIT compiler fails to eliminate the iterator allocation of collections, which are everywhere:
long sum = 0;
for (Iterator<Long> iter = list.iterator(); iter.hasNext();) {
sum += iter.next();
}
view raw gistfile1.java hosted with ❤ by GitHub

Buy why skepticize what you can analyze?

Theory: even simple iterators do not benefit from Escape Analysis

So lets give this some thought. I generally think the best of the JVM developer bunch. Sure they may miss here and there, they're just human after all, but we are talking about a good strong bunch of engineers. I tend to assume that if there is a simple case for an optimization that is worth while than they have not failed to capitalize on it. I would therefore be quite surprised if indeed iterators do not benefit from escape analysis as they seem quite natural candidates, particularly in the syntactic sugar case, but even in the explicit case. Still, my trust in the JVM engineers is no evidence, how can I prove this works for a given setup?
  1. Use a debug build... I invite the readers to try this method out, didna do it.
  2. Use a memory profiler, like the one packed with VisualVM or YourKit
  3. Setup an experiment to observe before/after effect of desired optimization. Use -XX:-+DoEscapeAnalysis and examine gc logs to observe effect.
  4. Look at the assembly...

Experiment: Observe a loop over an array list

Reach for your tool belts and pick the swiss army knife that is JMH. Here's the benchmark I will use for this:
@State(Scope.Thread)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public class IteratorEscapeAnalysis {
@Param({ "1000" })
public int size;
private List<Long> list;
@Setup(Level.Trial)
public void buildData() throws Exception {
list = new ArrayList<Long>();
for (long ii = 512; ii < size + 512; ii++) {
list.add(ii);
}
}
@Benchmark
public long sumIteratorOverList() {
long sum = 0;
for (Iterator<Long> iter = list.iterator(); iter.hasNext();) {
sum += iter.next();
}
return sum;
}
}
This is one of them times when I'm not particularly concerned with the performance of this bit of code as such, but JMH makes a good crucible for java code. The same effort that went into correct measurement enables the examination of code in an isolated manner. This makes JMH a good tool for testing the JIT compiler.

Measurement: Profile the experiment

I want to plug the experiment into a profiler, so I set the number of iterations to 1000 and get into it man, profiling, doing it you know, like a... like a sex machine, can I count it all? Here's YourKit reporting:

Ouch! that array list iterator is right there at the top! all that trust I put in the JVM developers? GONE! Let's get a second opinion from JVisualVM, just to be sure:

Oh me, oh my, this is bad...
Finally, with tears in our eyes, let us try see what Java Mission Control has to say:

How curious! the iterator allocation is gone! but JMC is outnumbered 2 to 1 here. Could it be that some profilers are pooping their pants?

Measurement: Measure with -XX:+/-DoEscapeAnalysis

Sometimes we get lucky and the optimization we want to examine comes with a handy flag to turn it on and off. We expect escape analysis to remove the iterator allocation and thus leave us with a benchmark which generates no garbage. We are also fortunate that JMH comes with the handy GC profiler which simply examines the GC JMX bean to inform us if any collections happened. Running the benchmark with a short list size and a small heap should trigger plenty of young generation GC cycle in each measurement iteration. Lo and behold, with escape analysis on:
$java -jar target/microbenchmarks.jar -f 1 -i 10 -wi 10 -p size=1000 \
 -jvmArgs="-Xms64m -Xmx64m -XX:+DoEscapeAnalysis" -prof gc ".*.sumIteratorOverList"
Benchmark                                           (size)    Score    Error   Units
sumIteratorOverList                                   1000  816.367 ± 17.768   ns/op
sumIteratorOverList:@gc.count.profiled                1000    0.000 ±    NaN  counts
sumIteratorOverList:@gc.count.total                   1000    0.000 ±    NaN  counts
sumIteratorOverList:@gc.time.profiled                 1000    0.000 ±    NaN      ms
sumIteratorOverList:@gc.time.total                    1000    0.000 ±    NaN      ms


$java -jar target/microbenchmarks.jar -f 1 -i 10 -wi 10 -p size=1000 \
 -jvmArgs="-Xms64m -Xmx64m -XX:-DoEscapeAnalysis" -prof gc ".*.sumIteratorOverList"
Benchmark                                           (size)    Score    Error   Units
sumIteratorOverList                                   1000  940.567 ± 94.156   ns/op
sumIteratorOverList:@gc.count.profiled                1000   19.000 ±    NaN  counts
sumIteratorOverList:@gc.count.total                   1000   42.000 ±    NaN  counts
sumIteratorOverList:@gc.time.profiled                 1000    9.000 ±    NaN      ms
sumIteratorOverList:@gc.time.total                    1000   27.000 ±    NaN      ms
Now that is what we hoped for, escape analysis saves the day! Please note above numbers are from running on my laptop using Oracle Java 8u20, I was not aiming for measurement accuracy, just wanted to verify the compilation/GC aspects. Laptops are good enough for that.

WTF Profilers? Why you lie?

There's a big difference in how JVisualVM/YourKit work and how JMC work, in particular:
  • JVisualVM/YourKit: treat the JVM as a black box and profile via the JVMTI interfaces and bytecode injection. Work on all JVMs.
  • Java Mission Control: use internal JVM counters/reporting APIs only available to the Oracle JVM (so can't profile OpenJDK/Zing/IBM-J9 etc)
So why should this get in the way of escape analysis? Searching through the OpenJDK source code we can spot the following:
void C2Compiler::compile_method(ciEnv* env, ciMethod* target, int entry_bci) {
  assert(is_initialized(), "Compiler thread must be initialized");

  bool subsume_loads = SubsumeLoads;
  bool do_escape_analysis = DoEscapeAnalysis &&!env->jvmti_can_access_local_variables();
This explains a similar issue, but not this one. The above code is relevant for debuggers and other tools relying on stepping through code and examining variable values.
This issue is the result of some instrumentation trapping the reference, at least for JVisualVM. We can look at the assembly to be certain, this is the iterator non-allocation before we connect JVisualVM to the process:
shl rbp,0x3 ;*invokeinterface iterator
; - psy.lob.saw.IteratorEscapeAnalysis::sumIteratorOverList@6 (line 39)
; - psy.lob.saw.IteratorEscapeAnalysis::sumIteratorOverListNotInlined@1 (line 48)
mov r13d,DWORD PTR [rbp+0xc] ;*synchronization entry
; - java.util.ArrayList$Itr::<init>@-1 (line 820)
; - java.util.ArrayList::iterator@6 (line 814)
; - psy.lob.saw.IteratorEscapeAnalysis::sumIteratorOverList@6 (line 39)
; - psy.lob.saw.IteratorEscapeAnalysis::sumIteratorOverListNotInlined@1 (line 48)
mov r9d,DWORD PTR [rbp+0x10] ;*getfield size
; - java.util.ArrayList::access$100@1 (line 102)
; - java.util.ArrayList$Itr::hasNext@8 (line 826)
; - psy.lob.saw.IteratorEscapeAnalysis::sumIteratorOverList@13 (line 39)
; - psy.lob.saw.IteratorEscapeAnalysis::sumIteratorOverListNotInlined@1 (line 48)
view raw gistfile1.asm hosted with ❤ by GitHub

Note how the iterator is initialized with no allocation actually taking place (note also how line 4 annotation is a big fat lie, this is actually getting modCount from the list). Method is recompiled to same when we attach the agent. And here's what happens when we turn on memory profiling:
shl rbp,0x3 ;*invokeinterface iterator
; - psy.lob.saw.IteratorEscapeAnalysis::sumIteratorOverList@6 (line 39)
; - psy.lob.saw.IteratorEscapeAnalysis::sumIteratorOverListNotInlined@1 (line 48)
mov rax,QWORD PTR [r15+0x60]
mov r10,rax
add r10,0x20
cmp r10,QWORD PTR [r15+0x70]
jae 0x000000010afc5f7b
mov QWORD PTR [r15+0x60],r10
prefetchnta BYTE PTR [r10+0xc0]
mov r10d,0xfee10af6 ; {oop('java/util/ArrayList$Itr')}
mov r10,QWORD PTR [r12+r10*8+0xb0]
mov QWORD PTR [rax],r10
mov DWORD PTR [rax+0x8],0xfee10af6
; {oop('java/util/ArrayList$Itr')}
mov DWORD PTR [rax+0xc],r12d
mov DWORD PTR [rax+0x1c],r12d
;*new ; - java.util.ArrayList::iterator@0 (line 814)
; - psy.lob.saw.IteratorEscapeAnalysis::sumIteratorOverList@6 (line 39)
; - psy.lob.saw.IteratorEscapeAnalysis::sumIteratorOverListNotInlined@1 (line 48)
mov DWORD PTR [rax+0x10],0xffffffff
;*putfield lastRet
; - java.util.ArrayList$Itr::<init>@11 (line 822)
; - java.util.ArrayList$Itr::<init>@2 (line 820)
; - java.util.ArrayList::iterator@6 (line 814)
; - psy.lob.saw.IteratorEscapeAnalysis::sumIteratorOverList@6 (line 39)
; - psy.lob.saw.IteratorEscapeAnalysis::sumIteratorOverListNotInlined@1 (line 48)
mov r11d,DWORD PTR [rbp+0xc] ;*getfield modCount
; - java.util.ArrayList$Itr::<init>@19 (line 823)
; - java.util.ArrayList$Itr::<init>@2 (line 820)
; - java.util.ArrayList::iterator@6 (line 814)
; - psy.lob.saw.IteratorEscapeAnalysis::sumIteratorOverList@6 (line 39)
; - psy.lob.saw.IteratorEscapeAnalysis::sumIteratorOverListNotInlined@1 (line 48)
mov DWORD PTR [rax+0x14],r11d
;*putfield expectedModCount
; - java.util.ArrayList$Itr::<init>@22 (line 823)
; - java.util.ArrayList$Itr::<init>@2 (line 820)
; - java.util.ArrayList::iterator@6 (line 814)
; - psy.lob.saw.IteratorEscapeAnalysis::sumIteratorOverList@6 (line 39)
; - psy.lob.saw.IteratorEscapeAnalysis::sumIteratorOverListNotInlined@1 (line 48)
mov r10,rbp
shr r10,0x3
mov DWORD PTR [rax+0x18],r10d
mov rbp,rax ;*synchronization entry
; - java.util.ArrayList$Itr::<init>@-1 (line 820)
; - java.util.ArrayList::iterator@6 (line 814)
; - psy.lob.saw.IteratorEscapeAnalysis::sumIteratorOverList@6 (line 39)
; - psy.lob.saw.IteratorEscapeAnalysis::sumIteratorOverListNotInlined@1 (line 48)
mov edx,0x364
mov rsi,rbp
call 0x000000010af9af60 ; OopMap{rbp=Oop off=176}
;*invokestatic traceObjAlloc
; - java.util.ArrayList::iterator@13 (line 814)
; - psy.lob.saw.IteratorEscapeAnalysis::sumIteratorOverList@6 (line 39)
; - psy.lob.saw.IteratorEscapeAnalysis::sumIteratorOverListNotInlined@1 (line 48)
; {static_call}
view raw gistfile1.asm hosted with ❤ by GitHub
The iterator is now allocated (see 'new' keyword at line 18 above).We can see that the instrumenting agent added a traceObjAlloc call into the iterator constructor (line 52 above). JVisualVM is open source, so we can dig into the code of the above trace method and the instrumentation code, I leave it as an exercise to the reader. If I was a better man I might have a go at fixing it, maybe another day.

Summary

  • EscapeAnalysis works, at least for some trivial cases. It is not as powerful as we'd like it, and code that is not hot enough will not enjoy it, but for hot code it will happen. I'd be happier if the flags for tracking when it happens were not debug only.
  • Profilers can kill EscapeAnalysis. In the example above we have very little code to look at, so it is easy to cross examine. In the real world you'd be profiling a much larger codebase, looking for allocation elimination opportunities, I suggest you have a healthy doubt in your profiler.
  • Java Mission Control is my current favourite profiler for the Oracle JVM. It's a shame the underlying APIs are not made public and standard so that tool makers can rely on them for all JVMs. Perhaps one day at least part of these APIs will be made public.
Thanks Peter Lawrey for reviewing this post!

Monday, 3 November 2014

The Mythical Modulo Mask

It is an optimisation well known to those who know it well that % of power of 2 numbers can be replaced by a much cheaper AND operator to the same power of 2 - 1. E.g:
x % 8 == x & (8 - 1)
[4/11/2014 NOTE] This works because the binary representation for N which is a power of 2 will have a single bit set to 1 and (N-1) will have all the bits below that set to 1 (e.g 8 = 00001000, 8-1= 00000111). When we do x AND (N-1) only the remainder of x / N will match the N-1 mask.
[4/11/2014 NOTE + slight spoiler: this only works when x >= 0]
[15/02/2016 NOTE: It has been pointed out that % is not modulo, but remainder. Why everyone calls it modulo I'm not sure]
The reason the & is  so much cheaper is because while % is implemented using the DIV instruction, & is just AND and as it turns out DIV is expensive and AND is cheap on x86 CPUs (and other places too I think). The optimisation is used in the Disruptor as well as the JCTools circular array queues and in the ArrayDequeue and other JDK classes. Is it time to replace % with & everywhere in your code which has this opportunity?
[4/11/2014 NOTE]  Technical term for this sort of optimization is Strength Reduction

Starting Simple

Lets start with some basic benchmarks:
private static final int LENGTH = 16;
int[] ints = new int[LENGTH];
int mask = LENGTH - 1;
int someIndex = 5;
@Benchmark
public int moduloLengthNoMask() {
return someIndex % ints.length;
}
@Benchmark
public int moduloLengthMask() {
return someIndex & (ints.length - 1);
}
@Benchmark
public int moduloConstantLengthNoMask() {
return someIndex % LENGTH;
}
@Benchmark
public int moduloMask() {
return someIndex & mask;
}
@Benchmark
public int consume() {
return someIndex;
}
@Benchmark
public void noop() {
}
view raw gistfile1.java hosted with ❤ by GitHub
And the results (on JDK8u5/E5-2697 v2 @ 2.70GHz/-XX:-UseCompressedOops for consistency between assembly and results):
  Benchmark                   Score   error  Units
  moduloLengthNoMask          3.448 ± 0.007  ns/op
  moduloConstantLengthNoMask  1.150 ± 0.002  ns/op
  moduloLengthMask            1.030 ± 0.006  ns/op
  moduloMask                  0.862 ± 0.001  ns/op
  consume                     0.719 ± 0.001  ns/op
  noop                        0.287 ± 0.000  ns/op

So pretty much as per expectation the modulo operation is far more expensive than the mask:
  • The clever JIT is aware of the optimisation opportunity and will replace a constant % with the &. It is not a perfect replacement, but pretty close.
  • At this sort of low digit ns benchmark we can’t make a statement such as “modulo is 4 times more expensive” because the same machine produces a baseline of 0.287ns/op for the noop benchmark and 0.719ns/op for the consume benchmark. If we deduct the consume result from the other scores we see a 1 : 25 ratio between the costs. Is that a good way to model performance? not really either, performance is not additive so simply subtracting one cost from the other doesn't really work at this scale. The truth is somewhere fuzzy in between and if we really care we should look at the assembly.
  • It seems that using a pre-calculated mask field is more awesome than using the "array length - 1" as a mask. That is consistent with the expectation that the re-calculation of the mask on the fly, as well as loading the value to be used for that calculation, is more expensive than using the pre-calculated field.
I love it when a plan comes together...

Going Deeper

The reason we wanted the modulo in the first place was to read from the array, right? so let’s try that:
@Benchmark
public int readByLengthNoMask() {
return ints[moduloLengthNoMask()];
}
@Benchmark
public int readByLengthMask() {
return ints[moduloLengthMask()];
}
@Benchmark
public int readByConstantLengthNoMask() {
return ints[moduloConstantLengthNoMask()];
}
@Benchmark
public int readByMask() {
return ints[moduloMask()];
}
@Benchmark
public int readNoMask() {
return ints[someIndex];
}
view raw gistfile1.java hosted with ❤ by GitHub

And the results:
  Benchmark                   Score   error  Units
  readByLengthNoMask          3.736 ± 0.005  ns/op
  readByConstantLengthNoMask  1.437 ± 0.001  ns/op
  readByMask                  1.347 ± 0.022  ns/op
  readByLengthMask            1.181 ± 0.049  ns/op
  readNoMask                  1.175 ± 0.004  ns/op
Well, what’s this I see? "length-1" mask is leading the chart! How’d that happen?
To quote from the famous “Jack and the FlumFlum Tree”:
“Don’t get your knickers in a twist!” said Jack,
“Let’s have a look in the patchwork sack.”
Lets start with the generated assembly for the constant modulo:
0x00000001107d45fc: mov r11d,DWORD PTR [rsi+0x18]
0x00000001107d4600: test r11d,r11d
0x00000001107d4603: jge 0x00000001107d462d
0x00000001107d4605: neg r11d; I be negative, un-negatize me
0x00000001107d4608: and r11d,0xf; mask me
0x00000001107d460c: neg r11d; turn me negative once more
Label DONE_MODULO:
0x00000001107d460f: mov r10,QWORD PTR [rsi+0x20] ;*getfield ints
0x00000001107d4613: mov r9d,DWORD PTR [r10+0x10] ; ints.length
0x00000001107d4617: cmp r11d,r9d ; bounds check
0x00000001107d461a: jae 0x00000001107d4633; IndexOutOfBounds
0x00000001107d461c: mov eax,DWORD PTR [r10+r11*4+0x18]; get ints[r11]
0x00000001107d4621: add rsp,0x20
0x00000001107d4625: pop rbp
0x00000001107d4626: test DWORD PTR [rip+0xfffffffffe5939d4],eax
0x00000001107d462c: ret
0x00000001107d462d: and r11d,0xf; hit me, I’m positive
0x00000001107d4631: jmp 0x00000001107d460f; goto DONE_MODULO
view raw gistfile1.asm hosted with ❤ by GitHub
I didna see that one coming! the modulo on a constant is not your garden variety & mask affair since it turns out our original assertion about the mask/modulo equality is only true for positive numbers. The JIT in it’s wisdom is dealing with the negative case by doing (x = -x; x = x&15; x = -x;).
I think the above case could be made a tiny bit faster by switching the branch around (so jump for negative value). It’s easy however to see what happens if we simplify the constant version further by using a constant mask:
@Benchmark
public int moduloConstantLengthMask() {
return someIndex & (LENGTH - 1);
}
@Benchmark
public int readByConstantLengthMask() {
return ints[moduloConstantLengthMask()];
}
view raw gistfile1.java hosted with ❤ by GitHub

And results:
  Benchmark                   Score   error  Units
  moduloConstantLengthNoMask  1.150 ± 0.002  ns/op
  moduloConstantLengthMask    0.860 ± 0.001  ns/op
  readByConstantLengthNoMask  1.437 ± 0.001  ns/op
  readByConstantLengthMask    1.209 ± 0.017  ns/op
So minor joy on the modulo, and reading is better than plain mask, nearly as good as the "length-1" mask. Oh well, let's move on.
The big surprise was the mask calculated on the fly from the array length version. How can calculating the mask on the fly, which seemed to be slower, end up being faster when reading from the array? Who feels like more assembly?
0x000000010dbd5ee0: mov r9,QWORD PTR [rcx+0x20] ;*getfield ints
0x000000010dbd5ee4: mov r11d,DWORD PTR [r9+0x10] ; get length
0x000000010dbd5ee8: mov r10d,r11d
0x000000010dbd5eeb: dec r10d; m = length - 1
0x000000010dbd5eee: and r10d,DWORD PTR [rcx+0x18]; someIndex & m
0x000000010dbd5ef2: cmp r10d,r11d; bound check
0x000000010dbd5ef5: jae 0x000000010dbd5f4f; ArrayIndexOutOfBounds
0x000000010dbd5ef7: mov r11d,DWORD PTR [r9+r10*4+0x18];
view raw gistfile1.asm hosted with ❤ by GitHub

I was hoping the JVM was clever enough to remove the array bound checks, but that didn’t happen. What’s happening here is that the length load serves the purpose of both creating the mask and checking the bounds. This is not the case for the mask version where we load the mask for the index calculation and the length for the bounds check, thus paying for 2 loads instead of one:
0x000000010e2de8a0: mov r10d,DWORD PTR [rcx+0x14]; someIndex
0x000000010e2de8a4: and r10d,DWORD PTR [rcx+0x18]; & mask
0x000000010e2de8a8: mov r9,QWORD PTR [rcx+0x20] ;*getfield ints
0x000000010e2de8ac: mov r11d,DWORD PTR [r9+0x10]; length
0x000000010e2de8b0: cmp r10d,r11d; bound check
0x000000010e2de8b3: jae 0x000000010e2de90d; ArrayIndexOutOfBounds
0x000000010e2de8b5: mov r11d,DWORD PTR [r9+r10*4+0x18];
view raw gistfile1.asm hosted with ❤ by GitHub

So removing the computation did not make a difference because the bound check requires the extra load of the length anyhow, can we make the bounds check go away? Of course we can, but it’s Unsafe!!! Let’s do it anyways!
private final static long ARRAY_OFFSET = UNSAFE.arrayBaseOffset(int[].class);
private final static int ELEMENT_SIZE = UNSAFE.arrayIndexScale(int[].class);
@Benchmark
public int unsafeReadByMask() {
return UNSAFE.getInt(ints, ARRAY_OFFSET + ELEMENT_SIZE * moduloMask());
}
view raw gistfile1.java hosted with ❤ by GitHub

The assembly:
0x0000000111306078: mov r9,QWORD PTR [r11+0x20] ;*getfield ints
0x000000011130607c: mov r8d,DWORD PTR [r11+0x14] ; someIndex
0x0000000111306080: and r8d,DWORD PTR [r11+0x18] ; someIndex & m
0x0000000111306084: shl r8d,0x2; (someIndex & m) << 2
0x0000000111306088: movsxd r8,r8d; (long)((someIndex & m) << 2)
0x000000011130608b: mov r8d,DWORD PTR [r9+r8*1+0x18]
view raw gistfile1.asm hosted with ❤ by GitHub

Shazzam! no bounds check, but look at all the work that’s gone into the unsafe read of the array. It would have been so much better if the unsafe read enjoyed the same addressing mode as normal array reads like so “r8d,DWORD PTR [r9+r10*4+0x18]”, but it seems the JIT compiler is not recognising the opportunity here. What’s the performance like?
  Benchmark                   Score   error  Units
  readByMask                  1.347 ± 0.022  ns/op
  readByLengthMask            1.181 ± 0.049  ns/op
  readNoMask                  1.175 ± 0.004  ns/op
  unsafeReadByMask            1.152 ± 0.001  ns/op

This is even better than no mask at all. Yay?
Well… sort of. If you mean to have the fastest ‘get’ from the array that allows for an array size which is not an application constant, than this is a mini-win. In particular is saves you a load of the array length in this case and loads can cost anything really. In the case where index and mask are long we can get better code generated:
0x00000001032dedb7: mov r10,QWORD PTR [r8+0x30] ;*getfield ints
0x00000001032dedbb: mov r11,QWORD PTR [r8+0x10] ; someIndexL
0x00000001032dedbf: and r11,QWORD PTR [r8+0x18] ; & maskL
0x00000001032dedc3: mov r11d,DWORD PTR [r10+r11*4+0x18]; good addressing
view raw gistfile1.asm hosted with ❤ by GitHub
But performance is much the same for this case. Seems like there’s not much left to win in this case.
For completeness sake we can compare the no mask result with an Unsafe equivalent:
  Benchmark                   Score   error  Units
  unsafeReadByNoMask          1.038 ± 0.022  ns/op
  readNoMask                  1.175 ± 0.004  ns/op

So it seems slipping past the array boundary check is worth something, but is it generally worth it? what if we weren't dealing with just the one element?

Bound Check Elimination

Looking at the above optimisation we need to accept that it is probably only worth it if array bound checks happen on every access. If we now compare a sum over an array:
@Benchmark
public int loopOverArrayStraight(){
int sum = 0;
for(int i = 0;i<ints.length;i++) {
sum += ints[i];
}
return sum;
}
@Benchmark
public int loopOverArrayUnsafeInt(){
int sum = 0;
// this is using the same 'get' logic as previously discussed unsafeReadNoMask
for(int i = 0; i < ints.length; i++) {
sum += UNSAFE.getInt(ints, ARRAY_OFFSET + ELEMENT_SIZE * i);
}
return sum;
}
@Benchmark
public int loopOverArrayUnsafeLong(){
int sum = 0;
// this is done in the hope of getting better addressing code avoiding going via an int
for(long i = 0; i < ints.length; i++) {
sum += UNSAFE.getInt(ints, ARRAY_OFFSET + ELEMENT_SIZE * i);
}
return sum;
}
view raw gistfile1.java hosted with ❤ by GitHub

We get the following results (length=100):
  Benchmark                    Score    error  Units
  loopOverArrayStraight        26.855 ± 0.060  ns/op
  loopOverArrayUnsafeInt       41.413 ± 0.056  ns/op
  loopOverArrayUnsafeLong      76.257 ± 0.171  ns/op
Oh Unsafe, why you so sucky sucky? How come the unsafe versions suck so significantly? isn’t Unsafe the cure to all performance problems?
Once the bounds check is eliminated by the JIT we can see that for the UnsafeInt we have the same issue with addressing conversion, only now the cost is not compensated for by the bound check removal. The UnsafeLong version is even worse, how come?
The generated loop for the int case is long and boring because it’s unrolled, the long case is pretty small:
Label Start:
0x0000000108832730: lea r10,[r9+rdi*4];r9=ints
0x0000000108832734: add rdi,0x1; rdi=i
0x0000000108832738: add r11d,DWORD PTR [r10+0x18]; r11d=sum
0x000000010883273c: test DWORD PTR [rip+0xfffffffffe5c48be],eax; {poll} -> Safepoint poll
0x0000000108832742: cmp rdi,rcx; rcx=array.length
0x0000000108832745: jl 0x0000000108832730; if(i<length) goto Start
view raw gistfile1.asm hosted with ❤ by GitHub

2 'bad' things just happened:
  1. Addressing didn’t workout the way we’d like. Instead of the desired “mov    r11d,DWORD PTR [r9+rdi*4+0x18]” we get a two stage setup where we do:”lea    r10,[r9+rdi*4]” and then “add    r11d,DWORD PTR [r10+0x18]”. Bummer.
  2. We got a safe point poll in the loop. This is happening because long indexed loops are considered potentially very long (as opposed to shorter int loops... heuristics for time to safe point) and so include a safe point poll.
So we want to fix the addressing mode and stick to having an int index. If we were to insist on using Unsafe (perhaps because we are trying to do this with off heap memory instead of an array) we’d have to do this:
@Benchmark
public int loopOverArrayUnsafePointer(){
int sum = 0;
long offset = ARRAY_OFFSET; // this is our relative "pointer" into the array, int*
for(int i = 0;i<ints.length;i++) { // use an int counter, no safe point poll
sum += UNSAFE.getInt(ints, offset); // offset is i * ELEMENT_SIZE
offset += ELEMENT_SIZE; // move pointer by one element
}
return sum;
}
view raw gistfile1.java hosted with ❤ by GitHub

[4/11/2014 NOTE]  Note that what we really want here is more than just getting rid of the multiplication/widening, we want the JIT to identify the expression calculated for offset as relative array access and pick the correct addressing mode for MOV to use. There are clever people out there trying to make sure this will work better in the future.
This removes the need for a safe point poll and simplifies addressing to the point where we nearly match the iteration over the array case (length=100):
  Benchmark                    Score    error  Units
  loopOverArrayStraight        26.855 ± 0.060  ns/op
  loopOverArrayUnsafePointer   27.377 ± 0.049  ns/op
We can explore the relationship between the implementations by testing for different array sizes:
            10   100     1000    10000
straight    4.3  26.8    289.2   2883.7
unsafeP     4.8  27.3    296.1   2886.4

So it seems that the smaller the array the more relative advantage the array iteration has when iterating in this fashion. This should not really be surprising, there's nothing here to confuse the JIT compiler and iterating over arrays is important enough to optimize. We have to work hard to get close to the JIT compiler when it does what it does best.


Summary

We had a simple optimisation in mind, replace a % with &:
  • Observed that for the case where constants are used the JIT is able to perform that optimisation for us almost as well as we’d do ourselves (we have no way of specifying positive only modulo, i.e uint).
  • We proved the viability of the optimisation in 2 variations, using a pre-calculated mask field and using (array.length - 1)
  • Using the optimisation in the context of a circular array read showed an interesting reversal in performance. We observed the cause of this reversal to be the array.length load for the purpose of bound checks reused for the calculated mask as opposed to the re-calculated.
  • Using Unsafe we managed to bypass the array bound check and get the best result using the mask for a single read. 
  • When we try the same method naively in a loop (over the whole array) array bound check is eliminated and plain old array access is the best performer.
  • To regain the performance for Unsafe access we have to tweak the code to avoid safe point polling as well as to get the addressing mode we want in the resulting assembly. Even then plain array access is better for smaller arrays.
Simple innit?
Some notes on methodology:
  • I ran the same experiments on different intel generations, you get different results but the assembly remains the same. E.g. on older CPUs the maximum instructions per cycle would be less than on the Ivy Bridge CPU I've used here, this will lead to instruction spilling over to the next cycle. The L1 latency could be higher leading to loads dominating the costs etc. This ends up giving a slightly different balance to compute vs. memory load. Overall analysis holds.
  • Using -XX:-UseCompressedOops was done for the sake of consistent assembly and results. Using compressed oops makes loads look a bit clumsier and I wanted to have less to explain. But running with the flag on (as it is by default) also effects results on this scale. In particular because the compressed oops require a shift to be used and shifters are a limited resource the CPU (1 on Westmere, 2 on Ivy Bridge) it can end up adding a cycle to the results.
  • Running these same experiments on a laptop was good for getting the assembly out and a vague sense of scale for results, but measurements had far greater error in that environment. Also note that laptops and desktops tend to be a generation ahead of servers where processors are concerned.
  • An interesting experiment would be to look at same experiment with the JMH perfasm profiler. I did that but could not figure out how to get Intel syntax out of it and so for consistency sake stuck with what I had. Left as an exercise to the reader :P
Many thanks to J.P. Bempel and Peter Hughes for reviewing, any issues remaining were added by me after they reviewed the post.

Wednesday, 29 October 2014

The JVM Write Barrier - Card Marking

In Java, not all value stores are created equal, in particular storing object references is different to storing primitive values. This makes perfect sense when we consider that the JVM is a magical place where object allocation, relocation and deletion are somebody else's problem. So while in theory writing a reference field is the same as writing the same sized primitive (an int on 32bit JVMs or with compressed oops on, or a long otherwise) in practice some accounting takes place to support GC. In this post we'll have a look at one such accounting overhead, the write barrier.

What's an OOP?

An OOP (Ordinary Object Pointer) is the way the JVM views Java object references. They are pointer representations rather than actual pointers (though they may be usable pointers). Since objects are managed memory OOPs reads/writes may require a memory barrier of the memory management kind (as opposed to the JMM ordering barrier kind):
"A barrier is a block on reading from or writing to certain memory locations by certain threads or processes.
Barriers can be implemented in either software or hardware. Software barriers involve additional instructions around load or store operations, which would typically be added by a cooperative compiler. Hardware barriers don’t require compiler support, and may be implemented on common operating systems by using memory protection." 
- Memory Management Reference, Memory Barrier
"Write barriers are used for incremental or concurrent garbage collection. They are also used to maintain remembered sets for generational collectors." 
- Memory Management Reference, Write Barrier
 In particular this means card marking.

Card Marking

All modern JVMs support a generational GC process, which works under the assumption that allocated objects mostly live short and careless lives. This assumption leads to GC algorithm where different generations are treated differently, and where cross generation references pose a challenge. Now imagine the time to collect the young generation is upon our JVM, what do we need to do to determine which young objects are still alive (ignoring the Phantom/Weak/Soft reference debate and finalizers)?
  • An object is alive if it is referenced by a live object.
  • An object is alive if a static reference to it exists (part of the root set).
  • An object is alive if a stack reference to it exists (part of the root set).
The GC process therefore:
"Tracing garbage collectors, such as copying, mark-sweep, and mark-compact, all start scanning from the root set, traversing references between objects, until all live objects have been visited.
A generational tracing collector starts from the root set, but does not traverse references that lead to objects in the older generation, which reduces the size of the object graph to be traced. But this creates a problem -- what if an object in the older generation references a younger object, which is not reachable through any other chain of references from a root?" - Brian Goetz, GC in the HotSpot JVM
Illustration By Alexey Ragozin
It is worth reading the whole article to get more context on the cross generational reference problem, but the solution is card marking:
"...the heap is divided into a set of cards, each of which is usually smaller than a memory page. The JVM maintains a card map, with one bit (or byte, in some implementations) corresponding to each card in the heap. Each time a pointer field in an object in the heap is modified, the corresponding bit in the card map for that card is set."
A good explanation of card marking is also given here by Alexey Ragozin. I have taken liberty to include his great illustration of the process.
So there you have it, every time an object reference is updated the compiler has to inject some accounting logic towards card marking. So how does this effect the code generated for your methods?

Default Card Marking

OpenJDK/Oracle 1.6/1.7/1.8 JVMs default to the following card marking logic (assembly for a setter such as setFoo(Object bar) ):
; rsi is 'this' address
; rdx is setter param, reference to bar
; JDK6:
mov QWORD PTR [rsi+0x20],rdx ; this.foo = bar
mov r10,rsi ; r10 = rsi = this
shr r10,0x9 ; r10 = r10 >> 9;
mov r11,0x7ebdfcff7f00 ; r11 is base of card table, imagine byte[] CARD_TABLE
mov BYTE PTR [r11+r10*1],0x0 ; Mark 'this' card as dirty, CARD_TABLE[this address >> 9] = 0
; JDK7(same):
mov QWORD PTR [rsi+0x20],rdx
mov r10,rsi
shr r10,0x9
mov r11,0x7f6d852d7000
mov BYTE PTR [r11+r10*1],0x0
; JDK8:
mov QWORD PTR [rsi+0x20],rdx
shr rsi,0x9 ; clever JIT noticed RSI is not in use later, so shifting in place
mov rdi,0x7f2d42817000
mov BYTE PTR [rsi+rdi*1],0x0
view raw gistfile1.asm hosted with ❤ by GitHub

So setting a reference throws in the overhead of a few instructions, which boil down to:
CARD_TABLE [this address >> 9] = 0;
This is significant overhead when compared to primitive fields, but is considered necessary tax for memory management. The tradeoff here is between the benefit of card marking (limiting the scope of required old generation scanning on young generation collection) vs. the fixed operation overhead for all reference writes. The associated write to memory for card marking can sometimes cause performance issues for highly concurrent code. This is why in OpenJDK7 we have a new option called UseCondCardMark.
[UPDATE: as JP points out in the comments, the (>> 9) is converting the address to the relevant card index. Cards are 512 bytes in size so the shift is in fact address/512 to find the card index. ]

Conditional Card Marking

This is the same code run with -XX:+UseCondCardMark:
; rsi is 'this' address
; rdx is setter param, reference to bar
; JDK7:
0x00007fc4a1071d5c: mov r10,rsi ; r10 = this
0x00007fc4a1071d5f: shr r10,0x9 ; r10 = r10 >> 9
0x00007fc4a1071d63: mov r11,0x7f7cb98f7000 ; r11 = CARD_TABLE
0x00007fc4a1071d6d: add r11,r10 ; r11 = CARD_TABLE + (this >> 9)
0x00007fc4a1071d70: movsx r8d,BYTE PTR [r11] ; r8d = CARD_TABLE[this >> 9]
0x00007fc4a1071d74: test r8d,r8d
0x00007fc4a1071d77: je 0x00007fc4a1071d7d ; if(CARD_TABLE[this >> 9] == 0) goto 0x00007fc4a1071d7d
0x00007fc4a1071d79: mov BYTE PTR [r11],0x0 ; CARD_TABLE[this >> 9] = 0
0x00007fc4a1071d7d: mov QWORD PTR [rsi+0x20],rdx ; this.foo = bar
view raw gistfile1.asm hosted with ❤ by GitHub

Which boils down to:
if (CARD_TABLE [this address >> 9] != 0) CARD_TABLE [this address >> 9] = 0; 
This is a bit more work, but avoids the potentially concurrent writes to the card table, thus side stepping some potential false sharing through minimising recurring writes. I have been unable to make JDK8 generate similar code with the same flag regardless of which GC algorithm I run with (I can see the code in the OJDK codebase... not sure what's the issue, feedback/suggestions/corrections welcome).

Card Marking G1GC style?

Is complicated... have a look:
movsx edi,BYTE PTR [r15+0x2d0] ; read some GC flag
cmp edi,0x0; if (flag != 0)
jne 0x00000001066fc601; GOTO OldValBarrier
Label WRITE:
mov QWORD PTR [rsi+0x20],rdx; this.foo = bar
mov rdi,rsi; rdi = this
xor rdi,rdx; rdi = this XOR bar
shr rdi,0x14; rdi = (this XOR bar) >> 20
cmp rdi,0x0; If this and bar are not same gen
jne 0x00000001066fc616; GOTO NewValBarrier
Label EXIT:
;…
Label OldValBarrier:
mov rdi,QWORD PTR [rsi+0x20]
cmp rdi,0x0; if(this.foo == null)
je 0x00000001066fc5dd; GOTO WRITE
mov QWORD PTR [rsp],rdi ; setup rdi as parameter
call 0x000000010664bca0 ; {runtime_call}
jmp 0x00000001066fc5dd; GOTO WRITE
Label NewValBarrier:
cmp rdx,0x0; bar == null
je 0x00000001066fc5f5 goto Exit
mov QWORD PTR [rsp],rsi
call 0x000000010664bda0 ; {runtime_call}
jmp 0x00000001066fc5f5 ; GOTO exit;
view raw gistfile1.asm hosted with ❤ by GitHub
To figure out exactly what this was about I had to have a read in the Hotspot codebase. A rough translation would be:
oop oldFooVal = this.foo;
if (GC.isMarking != 0 && oldFooVal != null){
  g1_wb_pre(oldFooVal);
}
this.foo = bar;
if ((this ^ bar) >> 20) != 0 && bar != null) {
  g1_wb_post(this);
}
 The runtime calls are an extra overhead whenever we  are unlucky enough to either:
  1. Write a reference while card marking is in process (and old value was not null)
  2. Target object is 'older' than new value (and new value is not null) [UPDATE: (SRC^TGT>>20 != 0) is a cross region rather than a cross generational check. Thanks Gleb!]
The interesting point to me is that the generated assembly ends up being somewhat fatter (nothing like your mamma) and has a significantly worse 'cold' case (cold as in less likely to happen), so in theory mixing up the generations will be painful.

Summary

Writing references incurs some overhead not present for primitive values. The overhead is in the order of a few instructions which is significant when compared to primitive types, but minor when we assume most applications read more than they write and have a healthy data/object ratio. Estimating the card marking impact is non-trivial and I will be looking to benchmark it in a later post. For now I hope the above helps you recognise card marking logic in your print assembly output and sheds some light on what the write barrier and card marking is about.


Tuesday, 28 October 2014

Celebrating 2 years of blogging!

2 years ago I started on this blog with a short and relatively under-exciting post about intrinsics. I was not happy with that first post. But you have to start somewhere I guess ;-). I set myself a target of writing 2 posts a month and pretty much kept to it (43 posts and 1 page). Some posts took huge investment, some less, I learnt something new while writing every one of them.
I spent last week at Joker Conf and Gee Con, I don't think I'd have been invited to speak in either was it not for my blog. I'm also pretty sure I owe my current job (and other job offers) to the blog. I'm still surprised to meet people who read it. Most seem happy. It proved to be allot of work, but just the sort of excuse I needed to dig deeper into corners of Java and concurrency I find exciting. Some of the effort that went into the blog became the ground work for JCTools. I guess what I'm trying to say is it worked out very well for me both in driving my learning process and gaining me some recognition that led to rewarding experiences and opportunities. Also, some other people seem to enjoy it :-)
The name of the blog proved puzzling for many (not a big surprise really), so in case you're still wondering where it came from, here's the relevant strip from Calvin & Hobbes:

I am a huge Bill Watterson fan, you should buy yourself the full C&H set, it will prove a more lasting reading material than any performance/Java/programming book you own. Also, I've seen many performance related discussions go a similar way to the above exchange...
A huge thank you to the readers, commentors and reviewers, urging me this way and steering me that way. Let's see if I can keep it up another 2 years :-)

Wednesday, 27 August 2014

Disassembling a JMH Nano-Benchmark

{UPDATE 03/09/14: If you come here looking for JMH related content start at the new and improved JMH Resources Page and branch out from there!}
I often feel it is nano-benchmarks that give microbenchmarks a bad name (that and the fact MBMs tend to sell crack and their young bodies). Putting to one side the latter issue for bleeding heart liberalists to solve, we are left with the former. In this post I'd like to help the budding nano-benchmark writer resolve and investigate the embarrassing dilemma of: "What just happened?"
"What just happened?" is a question you should almost always ask yourself when running a nano-benchmark. The chances of the compiler finding out your benchmark does nothing, or that significant part of your benchmark can be omitted, are surprisingly large. This is partly a case of extreme cleverness of compiler writers and partly the simplicity of the benchmark code potentially leaving the door open to optimisations perhaps not possible in the wild. The best way to answer the question is to have a look at the assembly end result of your benchmark code.
Hipster developer that I am, I use JMH to write microbenchmarks. Chances are you should too if you are writing nano/micro benchmarks as it goes a long way toward solving common issues. In the rest of this post we'll be looking at the assembly produced by JMH benchmarks and explaining away the framework so that you can more easily find your way in your own benchmark.

The NOOP benchmark

I started with the observation that nano-benchmarks sometime get optimized away, if they did they'd have the same end result as this benchmark:
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Thread)
public class BaselineBenchmarks {
@Benchmark
public void noop() {
}
}
view raw gistfile1.java hosted with ❤ by GitHub
Exciting stuff! So we measure nothing at all. How are we measuring this? JMH generates some code around a call to the above method that will do the measurement:
public void noop_avgt_jmhLoop(InfraControl control, RawResults result,
BaselineBenchmarks_1_jmh benchmark,
Blackhole_1_jmh l_blackhole1_1) throws Throwable {
long operations = 0;
long realTime = 0;
result.startTime = System.nanoTime();
do {
benchmark.noop(); // <-- the original noop()
operations++;
} while(!control.isDone);
result.stopTime = System.nanoTime();
result.realTime = realTime;
result.operations = operations;
}
view raw gistfile1.java hosted with ❤ by GitHub
So we have a while loop, spinning on the isDone flag and counting how many times we can manage to execute it until someone tells us to stop (by setting the isDone flag to true). It follows therefore that the measurement overhead is:
  • Reading the volatile field isDone (L1 hitting read, predictable)
  • Incrementing a counter (on the stack)
But healthy skepticism is what this is all about, let's see what the generated assembly looks like! I'll be gentle, assembly is often hard on the eyes.

Getting The Assembly Output

To try this at home you'll need a drink, a JVM setup to print assembley and the sample code. Build the project with maven and you run the benchmark and generate the assembly using the following command:
$JAVA_HOME/bin/java -XX:+UnlockDiagnosticVMOptions -XX:CompileCommand=print,*.noop_avgt_jmhLoop -XX:PrintAssemblyOptions=intel -XX:-UseCompressedOops -jar target/microbenchmarks.jar -i 5 -wi 5 -f 0 ".*.noop" > noop.ass
I'm only printing the measurement method, using the Intel syntax instead of the default AT&T and disabling compressed oops to get simpler output for this particular exercise. The output will contain several versions of the compiled method, I will be discussing the final version which is the last in the output.
Now we got the assembly printed we can get familiar with the structure of the JMH measurement loop as it is translated into assembly:
Decoding compiled method 0x00007fe5a106bb90:
Code:
[Entry Point]
[Constants]
# {method} 'noop_avgt_jmhLoop' '(Lorg/openjdk/jmh/runner/InfraControl;Lorg/openjdk/jmh/results/RawResults;Lpsy/lob/saw/generated/BaselineBenchmarks_noop$BaselineBenchmarks_1_jmh;Lpsy/lob/saw/generate
d/BaselineBenchmarks_noop$Blackhole_1_jmh;)V' in 'psy/lob/saw/generated/BaselineBenchmarks_noop'
# this: rsi:rsi = 'psy/lob/saw/generated/BaselineBenchmarks_noop'
# parm0: rdx:rdx = 'org/openjdk/jmh/runner/InfraControl'
# parm1: rcx:rcx = 'org/openjdk/jmh/results/RawResults'
# parm2: r8:r8 = 'psy/lob/saw/generated/BaselineBenchmarks_noop$BaselineBenchmarks_1_jmh'
# parm3: r9:r9 = 'psy/lob/saw/generated/BaselineBenchmarks_noop$Blackhole_1_jmh'
# [sp+0x20] (sp of caller)
0x00007fe5a106bce0: cmp rax,QWORD PTR [rsi+0x8]
0x00007fe5a106bce4: jne 0x00007fe5a1037960 ; {runtime_call}
0x00007fe5a106bcea: xchg ax,ax ; [NW] NOP
0x00007fe5a106bcec: nop DWORD PTR [rax+0x0] ; [NW] NOP, Align the function body
[Verified Entry Point]
0x00007fe5a106bcf0: mov DWORD PTR [rsp-0x14000],eax
0x00007fe5a106bcf7: push rbp
0x00007fe5a106bcf8: sub rsp,0x10 ;*synchronization entry
; - psy.lob.saw.generated.BaselineBenchmarks_noop::noop_avgt_jmhLoop@-1 (line 156)
0x00007fe5a106bcfc: mov r13,rdx ; [NW] control ref is now in R13
0x00007fe5a106bcff: mov rbp,r8 ; [NW] benchmark ref is now in RBP
0x00007fe5a106bd02: mov rbx,rcx ; [NW] result is now in RBX
view raw gistfile1.asm hosted with ❤ by GitHub

This is just the preliminaries for the method, so not much to see except noting which reference is in which register to help interpret the rest of the code. The comments in the printout are generated by the JVM, my comments are prefixed with [NW].
Once all the pieces are in place we can move on to some actual work.

Measurement Loop: 2 Timestamps diverged in a yellow wood

Refresh you memory of what the java code above does and let's see if we can see it here:
0x00007fe5a106bd05: mov r10,0x7fe5abd38f20 ; [NW] Setup the System.nanoTime() address for call
0x00007fe5a106bd0f: call r10 ;*invokestatic nanoTime
; - psy.lob.saw.generated.BaselineBenchmarks_noop::noop_avgt_jmhLoop@7 (line 158)
; [NW] RBX is result, RBX+0x20 is result.startTime
0x00007fe5a106bd12: mov QWORD PTR [rbx+0x20],rax ;*putfield startTime
; - psy.lob.saw.generated.BaselineBenchmarks_noop::noop_avgt_jmhLoop@10 (line 158)
; implicit exception: dispatches to 0x00007fe5a106bd81
0x00007fe5a106bd16: mov r11,rbp ; [NW] R11 = RBP which is l_baselinebenchmarks0_0
; [NW] the following 2 lines mean: "if (l_baselinebenchmarks0_0 == null) throw new NullPointerException();"
0x00007fe5a106bd19: test r11,r11 ; [NW] R11 & R11
0x00007fe5a106bd1c: je 0x00007fe5a106bd70 ;*invokevirtual noop
; - psy.lob.saw.generated.BaselineBenchmarks_noop::noop_avgt_jmhLoop@14 (line 160)
; [NW] R13 is control
0x00007fe5a106bd1e: movzx r10d,BYTE PTR [r13+0x9c] ;*getfield isDone
; - psy.lob.saw.generated.BaselineBenchmarks_noop::noop_avgt_jmhLoop@24 (line 162)
; implicit exception: dispatches to 0x00007fe5a106bd95
; [NW] EBP is the lower half of RBP, set it to 1
0x00007fe5a106bd26: mov ebp,0x1
; [NW] following 2 lines mean: "if (isDone == true) goto FINISH;"
0x00007fe5a106bd2b: test r10d,r10d
0x00007fe5a106bd2e: jne 0x00007fe5a106bd47 ;*aload_3
; - psy.lob.saw.generated.BaselineBenchmarks_noop::noop_avgt_jmhLoop@13 (line 160)
; [NW] LOOP START:
0x00007fe5a106bd30: movzx r10d,BYTE PTR [r13+0x9c] ;*getfield isDone
; - psy.lob.saw.generated.BaselineBenchmarks_noop::noop_avgt_jmhLoop@24 (line 162)
; [NW] RBP is operations, so do "operations++;"
0x00007fe5a106bd38: add rbp,0x1 ; OopMap{r11=Oop rbx=Oop r13=Oop off=92}
;*ifeq
; - psy.lob.saw.generated.BaselineBenchmarks_noop::noop_avgt_jmhLoop@27 (line 162)
; [NW] Safepoint!
0x00007fe5a106bd3c: test DWORD PTR [rip+0xb5482be],eax # 0x00007fe5ac5b4000
; {poll}
; [NW] following 2 lines mean: "if (isDone != true) goto LOOP START;"
0x00007fe5a106bd42: test r10d,r10d
0x00007fe5a106bd45: je 0x00007fe5a106bd30 ;*aload_2
; - psy.lob.saw.generated.BaselineBenchmarks_noop::noop_avgt_jmhLoop@30 (line 163)
; [NW] FINISH : Finished measuring
0x00007fe5a106bd47: mov r10,0x7fe5abd38f20 ; [NW] Setup the System.nanoTime() address for call
0x00007fe5a106bd51: call r10 ;*invokestatic nanoTime
; - psy.lob.saw.generated.BaselineBenchmarks_noop::noop_avgt_jmhLoop@31 (line 163)
; [NW] Populate the result fields
0x00007fe5a106bd54: mov QWORD PTR [rbx+0x10],rbp ;*putfield operations
; - psy.lob.saw.generated.BaselineBenchmarks_noop::noop_avgt_jmhLoop@46 (line 165)
0x00007fe5a106bd58: mov QWORD PTR [rbx+0x28],rax ;*putfield stopTime
; - psy.lob.saw.generated.BaselineBenchmarks_noop::noop_avgt_jmhLoop@34 (line 163)
; [NW] Note that realTime variable is gone, we just set the field to 0
0x00007fe5a106bd5c: mov QWORD PTR [rbx+0x18],0x0 ;*putfield realTime
; - psy.lob.saw.generated.BaselineBenchmarks_noop::noop_avgt_jmhLoop@40 (line 164)
; [NW] method wrap up mechanics, including another safepoint before returning
0x00007fe5a106bd64: add rsp,0x10
0x00007fe5a106bd68: pop rbp
0x00007fe5a106bd69: test DWORD PTR [rip+0xb548291],eax # 0x00007fe5ac5b4000
; {poll_return}
0x00007fe5a106bd6f: ret
view raw gistfile1.asm hosted with ❤ by GitHub
Have a sip and scan slowly. Here's some nuggets to consider:

  • As expected the noop() method is not called and any mention of it is gone from the measurement loop.
  • The first iteration of the loop has been 'peeled', this is common practice.
  • Even though we never call noop(), we still have to do the null check for the benchmark reference.
  • The sharp of eye reader will have noticed the redundant realTime variable in the generated measurement loop, so has the JIT compiler and it has been replaced with setting the result.realTime field directly to 0.
  • RBP is an 8 byte register, EBP is the lower half of the same register. Setting EBP to 1 in the peeled first iteration is the same as setting RBP to 1.
  • The measurement loop includes a safepoint! put that down as further measurement overhead.
This is the simplest benchmark one can write with JMH. On my test machine (an Intel Xeon E5-2697 v2 @ 2.70GHz) doing nothing is quite fast at 0.288 ns/op.
As you may have expected, reading the generated assembly is not so pleasant, I find the generated comments are very helpful for orientation and the timestamp calls on either side of the measurement loop help in zooming in on the important bits.

A Nano-Benchmark: i++

Nothing says "nano-benchmark" like benchmarking a single operation. Let's have a go at it!
int i;
@Benchmark
public void increment() {
i++;
}
view raw gistfile1.java hosted with ❤ by GitHub
The generated loop is the same, but this time that crafty old JIT compiler cannot just do nothing with our code. We will finally learn the true cost of incrementing an integer! Given the overhead includes a long increment already I might even guess the cost at 0.25 ns/op, so maybe the result reported by JMH will be 0.5 ns/op? A warm fuzzy feeling of wisdom.
But when I run this benchmark on the same machine I learn to my dismay that incrementing an integer takes 1.794 ns/op according to my JMH benchmark. Damn integers! why does the JVM torture us so with slow integer increments?
This is a silly benchmark, and the result makes absolutely no sense as an estimate of the cost of the ++ operator on integers. So what does it mean? Could it be that the JIT compiler failed us? Lets have a look at the assembly:
; [NW] START: R8 is benchmark, [r8+0x10] is benchmark.i
0x00007f1d35068640: inc DWORD PTR [r8+0x10] ;*invokevirtual increment
; - psy.lob.saw.generated.BaselineBenchmarks_increment::increment_avgt_jmhLoop@14 (line 160)
; [NW] R13 is control
0x00007f1d35068644: movzx r10d,BYTE PTR [r13+0x9c] ;*getfield isDone
; - psy.lob.saw.generated.BaselineBenchmarks_increment::increment_avgt_jmhLoop@24 (line 162)
; [NW] operations++
0x00007f1d3506864c: add rbp,0x1 ; OopMap{r8=Oop rbx=Oop r13=Oop off=112}
;*ifeq
; - psy.lob.saw.generated.BaselineBenchmarks_increment::increment_avgt_jmhLoop@27 (line 162)
; [NW] Safepoint
0x00007f1d35068650: test DWORD PTR [rip+0xb5119aa],eax # 0x00007f1d4057a000
; {poll}
; [NW] if(!isDone) goto START
0x00007f1d35068656: test r10d,r10d
0x00007f1d35068659: je 0x00007f1d35068640 ;*aload_2
; - psy.lob.saw.generated.BaselineBenchmarks_increment::increment_avgt_jmhLoop@30 (line 163)
view raw gistfile1.asm hosted with ❤ by GitHub
So why is the reported cost so much higher than our expectation?

What just happened?

My increment method got translated perfectly into: "inc DWORD PTR [r8+0x10]". There is no compiler issue.  The comparison I made between incrementing the operations counter and incrementing the benchmark field is flawed/misguided/stupid/ignorant when taking into account the benchmark framework.
The context in which we increment operations is:
  • It's a long variable allocated on the stack
  • It's used in a very small methods where there is no register pressure
  • It follows that operations is always a register
  • ADD/INC on a register cost very little (it's the cheapest thing you can do usually)
The context in which we increment benchmark.i is:
  • It's a field on the benchmark object
  • It's subject to happens-before rules so cannot be hoisted into a register inside the measurement loop (because control.isDone is a volatile read, see this post for more detail)
  • It follows that benchmark.i is always a memory location
  • INC on a memory location is not so cheap (by nano benchmark standards)
Consulting with the most excellent Agner Fog instructions tables tells me that for Ivy Bridge the latency for INC on memory is 6 cycles, while the latency on ADD for a register is 1. This indeed agrees to some extent with the cost reported by JMH (assuming 0.288 was for one cycle, 0.288 * 6 = 1.728 which is pretty close to 1.794).  But that's bad analysis. The truth is that cost is not additive, particularly when nano-benchmarks are concerned. In this case the cost of the INC seems to swallow up the baseline cost we measured before. 
Is there something wrong with JMH? I don't think so. If we take the benchmark to be "an attempt at estimating the cost of calling a method which increments a field" then I would argue we got a valid answer. It's not the only answer however. Calling the same method in a context which allows further optimizations would yield a different answer.


Tuesday, 12 August 2014

The volatile read suprise

{UPDATE 03/09/14: If you come here looking for JMH related content start at the new and improved JMH Resources Page and branch out from there!}
On occasion, and for perfectly good reasons, I find myself trying to answer such deep existential questions as this one. Which is faster:
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Thread)
public class LoopyBenchmarks {
@Param({ "32", "1024", "32768" })
int size;
byte[] bunn;
@Setup
public void prepare() {
bunn = new byte[size];
}
@Benchmark
public void goodOldLoop(Blackhole fox) {
for (int y = 0; y < bunn.length; y++) { // good old C style for (the win?)
fox.consume(bunn[y]);
}
}
@Benchmark
public void sweetLoop(Blackhole fox) {
for (byte bunny : bunn) { // syntactic sugar loop goodness
fox.consume(bunny);
}
}
}
view raw gistfile1.java hosted with ❤ by GitHub

As you can see from the sample I turn to JMH to help me resolve such questions. If you know not what JMH is you may enjoy reading previous posts on the subject (start with this one). In short it is a jolly awesome framework for benchmarking java:
  • @Benchmark annotated methods will get benchmarked
  • The framework will pass in a Blackhole object that will pretend to 'consume' the values you pass into it and thus prevent the JIT compiler from dead code eliminating the above loops to nothing.
Assuming we are all on the same page with this snippet above, let the game begin!

Yummy yummy sugar!

So I ran the above benchmarks on some heavy duty benchmarking machine and get the following results for different array sizes:
Benchmark (size) Score Score error Units
goodOldLoop 32 46.630 0.097 ns/op
goodOldLoop 1024 1199.338 0.705 ns/op
goodOldLoop 32768 37813.600 56.081 ns/op
sweetLoop 32 19.304 0.010 ns/op
sweetLoop 1024 475.141 1.227 ns/op
sweetLoop 32768 14295.800 36.071 ns/op
view raw gistfile1.txt hosted with ❤ by GitHub
It sure looks like that syntactic sugar is much better! more than twice as fast! awesome?

Must give us pause

At this point we could either:
  1. Declare syntactic sugar the clear winner and never write the old style for loops ever again 'cause they be slow like everything old! we hates them old loops! hates them!
  2. Worry that we are being a bit stupid
I get very little sleep and I was never very bright, so I'll go for 2. 
This benchmark result seems off, it's not what we expect. It would make sense for the JVM to make both loops the same, and yet they seem to work out very differently. Why, god? whhhhhhhy?
The above benchmark is a tiny piece of code, and is a fine example of a nano-benchmark (to use the term coined by Shipilev for benchmarks of nano-second scale). These are pretty suspect benchmarks at the best of time so you want to be quite alert when trying to make sense of them. When stuff doesn't make sense it is best to see what the JIT compiler made of your code and hit the assembly! Printing the JIT generated assembly is a neat party trick (sure to win you new friends and free drinks) and results in loads of funky text getting thrown at you. I was going to do a whole walk through the assembly but I have promises to keep and miles to walk before I sleep (some other time, I promise). So lets just skip to the WTF moment.

Into the hole

The assembly code for the goodOldLoop is long and painful to read through, and that in itself is a clue. Once you work out the control flow you'll sit there scratching your head and wondering. The thing that stands out (when the assembly smoke clears) is that bunn is loaded on every iteration, bunn.length is loaded and an array boundary check happens. This is surely a terrible way to interpret a for loop... 
The culprit turns out to be a volatile read in Blackhole.consume:
//...
public volatile byte b1, b2;
public volatile BlackholeL2 nullBait = null;
/**
* Consume object. This call provides a side effect preventing JIT to eliminate dependent computations.
*
* @param b object to consume.
*/
public final void consume(byte b) {
if (b == b1 & b == b2) {
// SHOULD NEVER HAPPEN
nullBait.b1 = b; // implicit null pointer exception
}
}
view raw gistfile1.java hosted with ❤ by GitHub

The above method ensures that a consumed value will not be subject to DCE even if it is completely predictable. The values for b1, b2 being volatile cannot be assumed to stay the same and so require re-examination. The side effect is however that we now have a volatile load in the midst of our for loop. A volatile load of one value requires the JVM to load all subsequent loads from memory to force happens before relationships, in this case the field bunn is reloaded on every iteration of the loop. If bunn may have changed then it's length may have also changed... sadness follows. To test this theory we can make a third loop:
@Benchmark
public void goodOldLoopReturns(Blackhole fox) {
byte[] sunn = bunn; // make a local copy of the field
for (int y = 0; y < sunn.length; y++) {
fox.consume(sunn[y]);
}
}
view raw gistfile1.java hosted with ❤ by GitHub

This performs much like the sweet syntactic sugar version:
Benchmark (size) Score Score error Units
goodOldLoopReturns 32 19.306 0.045 ns/op
goodOldLoopReturns 1024 476.493 1.190 ns/op
goodOldLoopReturns 32768 14292.286 16.046 ns/op
sweetLoop 32 19.304 0.010 ns/op
sweetLoop 1024 475.141 1.227 ns/op
sweetLoop 32768 14295.800 36.071 ns/op
view raw gistfile1.txt hosted with ❤ by GitHub

Lessons learnt?

  • Nano benchmarks and their results are hard to interpret. When in doubt read the assembly, when not in doubt smack yourself to regain doubt and read the assembly. It's very easy for a phenomena you are not looking to benchmark to slip into the benchmark.
  • Sugar is not necessarily bad for you. In the above case the syntactic sugar interpretation by the JVM was a better match to our intuition than the explicit old school loop. By being explicit we inhibited optimisation, despite intending the same thing. The enhanced for loop, as the JLS calls it, is semantically different from the basic for loop in that it assumes some sort of snapshot iterator taken at the beginning of the loop and used throughout, which for primitive arrays means taking the form used in goodOldLoopReturns.
  • Blackhole.consume is also a memory barrier, and these come with some side effects you may not expect. In larger benchmarks these may be negligible but in nano benchmarks every little thing counts. This is a fine use case for a 'weak' volatile read, one which requires a memory read but no memory barrier(previous post on the compound meaning of the volatile access)

Friday, 8 August 2014

The many meanings of volatile read and write

Just a quick note on the topic as I find I keep having this conversation. Volatile fields in Java provide three distinct features:
  1. Atomicity: volatile long and double fields are guaranteed to be atomically written. This is not the case otherwise for long double. See JLS section 17.7 for more details. Also see this excellent argument made by Shipilev on why all fields could be made atomic with no significant downside.
  2. Store/Load to/from memory: a normal field load may get hoisted out of a loop and be done once, a volatile field is prevented from being optimized that way and will be loaded on each iteration. Similarly stores are to memory and will not be optimized.
  3. Global Ordering: A volatile write acts as a StoreLoad barrier thus preventing previous stores from being reordered with following loads. A volatile read acts as a LoadLoad barrier and prevents following loads from happening before it. This is opposed to the meaning of volatile in C/C++ where only other volatile loads/stores are prevented from reordering.
I would personally prefer to have these more refined tools at my disposal for when I need them, but volatile is a 3-in-1 sort of tool...

What about AtomicLong.lazySet?

For those of you wondering (as I did) weather or not AtomicLong.lazySet (A.K.A Unsafe.putOrderedLong) provides atomicity, it would seem the answer is yes. Digging through the JVM source code for the putOrderedLong intrinsic yields the following nugget:
bool LibraryCallKit::inline_unsafe_ordered_store(BasicType type) {
// This is another variant of inline_unsafe_access, differing in
// that it always issues store-store ("release") barrier and ensures
// store-atomicity (which only matters for "long").
/* ... all this unpleasant sea of nodes stuff ... not what I want to talk about ... */
insert_mem_bar(Op_MemBarRelease);
insert_mem_bar(Op_MemBarCPUOrder);
// Ensure that the store is atomic for longs: <--- Yay!
const bool require_atomic_access = true;
Node* store;
if (type == T_OBJECT) // reference stores need a store barrier.
store = store_oop_to_unknown(control(), base, adr, adr_type, val, type, MemNode::release);
else {
store = store_to_memory(control(), adr, val, type, adr_type, MemNode::release, require_atomic_access);
}
insert_mem_bar(Op_MemBarCPUOrder);
return true;
}
view raw gistfile1.cpp hosted with ❤ by GitHub
Look at that perfectly pleasant C++ code! The store is indeed made atomic. We can further test this observation by looking at the generated assembly for a 32 vs 64 bit JVM:
;A 64 bit JVM:
mov QWORD PTR [rsi+0x118],rcx ;*invokevirtual putOrderedLong
; - org.jctools.queues.InlinedCountersSpscConcurrentArrayQueue::tailLazySet@8 (line 131)
; - org.jctools.queues.InlinedCountersSpscConcurrentArrayQueue::offer@83 (line 163)
;A 32 bit JVM:
vmovd xmm0,ecx
vmovd xmm1,ebx
vpunpckldq xmm0,xmm0,xmm1
vmovsd QWORD PTR [esi+0x110],xmm0 ;*invokevirtual putOrderedLong
; - org.jctools.queues.InlinedCountersSpscConcurrentArrayQueue::tailLazySet@8 (line 131)
; - org.jctools.queues.InlinedCountersSpscConcurrentArrayQueue::offer@83 (line 163)
view raw gistfile1.asm hosted with ❤ by GitHub
There you go! Atomicity is perserved! Hoorah!

Monday, 28 July 2014

Poll me, maybe?

{This post is part of a long running series on lock free queues, checkout the full index to get more context here}
The java.util.Queue interface has the ever important poll/offer methods, documented thus:
/**
* Retrieves and removes the head of this queue, or returns {@code null}
* if this queue is empty.
*
* @return the head of this queue, or {@code null} if this queue is empty
*/
E poll();
/**
* Inserts the specified element into this queue if it is possible to do
* so immediately without violating capacity restrictions.
* When using a capacity-restricted queue, this method is generally
* preferable to {@link #add}, which can fail to insert an element only
* by throwing an exception.
*
* @param e the element to add
* @return {@code true} if the element was added to this queue, else
* {@code false}
*/
boolean offer(E e);
view raw gistfile1.java hosted with ❤ by GitHub
This allows the caller to assume that if the poll method returns null the queue is empty, and in a single threaded world this is quite straight forward. A problem for multi producer lock free concurrent queues however is that while an element may not be visible the queue may not be empty. "WOT?" I hear you cry, lets look at an example:
public E poll() {
LinkedQueueNode<E> nextNode = consumerNode.lvNext();
if (nextNode != null) {
final E nextValue = nextNode.getAndNullValue();
consumerNode = nextNode;
return nextValue;
}
return null;
}
public boolean offer(final E nextValue) {
if (nextValue == null) {
throw new IllegalArgumentException("null elements not allowed");
}
final LinkedQueueNode<E> nextNode = new LinkedQueueNode<E>(nextValue);
// xchg is like AtomicRef.getAndSet
final LinkedQueueNode<E> prevProducerNode = xchgProducerNode(nextNode);
// PUFF
prevProducerNode.soNext(nextNode);
return true;
}
view raw gistfile1.java hosted with ❤ by GitHub
The above snippet shows the offer/poll methods of an MPSC linked queue as originated by Mr. Vyukov(MPSC - Multi-Prodcuer, Single-Consumer) and ported into Java by your humble servant (though others have ported this algo before me: Akka/RxJava/Netty, and others...).
Where be dragons?

Multiple Producers Break The Chain

How the above algorithm works for multiple producers:

  • On line 17 we use the same mechanism offered by AtomicReference.getAndSet to atomically swap the producerNode with the new node.
  • This means no producerNode is ever returned more than once, preventing producer threads from overwriting the producerNode reference field and that node's reference to the next node.
  • We use the same mechanism offered by AtomicReference.lazySet to set this new node as the next node from the previous.
  • On the consumer thread side we process nodes by grabbing the next node and replacing the consumerNode with it after we pluck out it's value.

The problem is in the offer method lines 17-19 where we first xchgProducerNode and then set the new node (now the producerNode) to be the next node after the previous producerNode. For a short moment the old producer node has null as the next element, the chain is broken. This short moment is of undefined length, depending on scheduler pressure and the whimsical sense of humour of the gods the producer thread which got interrupted at line 18  (puff) may be held up for a while.
And yet, while this producer thread is sleeping (what dreams may come?), other producers can make progress. They may add any number of nodes to the queue, each linking the producerNode to the next. The producerNode can be at any 'distance' from that suspended node on that suspended thread waiting for it's next field to be patched through and may have any number of further broken links in the way.
Looking at the poll method in this light the problem becomes more obvious. If a node may have it's next set to null due to the timing described above, then poll may return null when the queue is in fact not empty.

Unbreaking The Chain

To fix this issue we could do the following:
public E poll() {
LinkedQueueNode<E> nextNode = consumerNode.lvNext();
if (nextNode != null) {
final E nextValue = nextNode.getAndNullValue();
consumerNode = nextNode;
return nextValue;
}
// volatile load of producerNode
else if (consumerNode != lvProducerNode()) {
do {
nextNode = consumerNode.lvNext(); // spin :(
} while(nextNode == null); // how long will I wait?
// nextNode is finally here…
final E nextValue = nextNode.getAndNullValue();
consumerNode = nextNode;
return nextValue;
}
return null;
}
view raw gistfile1.java hosted with ❤ by GitHub

And indeed this is how other implementors have chosen to tackle the issue (in spirit, if not in code).
In doing so we have given up quite allot however:
  1. poll() is no longer wait free. Which is a shame, I quite like wait freedom.
  2. The consumer thread is volunteered into spin waiting on the next field to become visible.
  3. In the event of hitting null we now read the producerNode. This introduces the potential for a cache miss. This is a big problem to my mind as this cache miss has an unknown and potentially very large cost.
  4. The producerNode read from the consumer thread will have a negative impact on the producer threads contending to write to it. This has been previously explored here. This will be particularly bad when the consumer is spinning on the poll() method while waiting for the next value.
Is it worth it? I'm not convinced... Given that a Queue already mandates an isEmpty method could we not settle for a relaxed definition of poll? given the above observation of the queue emptiness is also (by it's lock free and concurrent nature) imprecise should we really sacrifice performance/scalability for it?
At the moment I am leaning towards offering weaker guarantees for the lock-free queues offered as part of JCTools, but I'm hoping to get some feedback from prospective users on how important that part of the queue interface is to their purposes.

NOTE: The same issue exists for the offer method when we look at an SPMC queue, as discussed in this issue.

Tuesday, 8 July 2014

Concurrent Bugs: Size Matters

{This post is part of a long running series on lock free queues, checkout the full index to get more context here}As part of my ongoing work on JCTools (no relation to that awfully popular Mexican dude) I've implemented SPSC/MPSC/MPSC/MPMC lock free queues, all of which conform to the java.util.Queue interface. The circular array/ring buffered variants all shared a similar data structure where a consumer thread (calling the poll() method) writes to the consumerIndex and a producer thread (calling the offer() method) writes to the producerIndex. They merrily chase each other around the array as discussed in this previous post.
Recently I've had the pleasure of having Martin Thompson pore over the code and contribute some bits and pieces. Righteous brother that Mr. T is, he caught this innocent looking nugget (we're playing spot the bug, can you see it?):
/**
* Returns the number of elements in this collection, 0 if empty.
*/
@Override
public int size() {
// lv means load volatile, like a volatile field read.
// Producer is at the producer index (ahead of the consumer), a long
// Consumer is at the consumer index, also a long
// -> size is the delta.
return (int) (lvProducerIndex() - lvConsumerIndex());
}
view raw gistfile1.java hosted with ❤ by GitHub

What could possibly go wrong?


Like bugs? you'll looove Concurrency!

The code above can return a negative number as the size. How can that be you wonder? Let us see the same method with some slow motion comments:
/**
* Returns the number of elements in this collection, 0 if empty.
*/
@Override
public int size() {
// We read the producer index, it's 5 elements ahead of the consumer.
long currProducerIndex = lvProducerIndex();
// So now we got this number we read (let's say it's 42),
// and some other bastard thread pushes in, we'll get the
// CPU back in a second. Meanwhile our consumer is beavering
// along.
// Finally we make some progress and read the consumer index
long currConsumerIndex = lvConsumerIndex();
// Oh look, the consumer consumed 10 elements while we were
// away, must be that the producer also made some progress.
// Them is the bessssst threads I know. Let's return size:
return (int) (currProducerIndex - currConsumerIndex);
// 42 - 47 = -5. Damn and blast.
}
view raw gistfile1.java hosted with ❤ by GitHub

See it?
  • We need not get suspended for this to happen. The independent loads could have the second one hit a cache miss and get delayed, or maybe the consumer and the producer are just moving furiously fast. The important thing to realise is that the loads are independent and so while there may not be much time between the 2, there is the potential for some time by definition.
  • Because the 2 loads are volatile they cannot get re-ordered.
  • Doing it all on one line, or on 3 different lines (in the same order) makes no difference at all.

So how do we solve this?
@Override
public int size() {
int size;
do {
// take consumer index first, it's the lowest of the 2 at any given time
final long currentConsumerIndex = lvConsumerIndex();
// take the producer index, maybe some time passed, we don't know
final long currentProducerIndex = lvProducerIndex();
// our best estimate of the size is still the delta. It's not a negative
// number as both indexes are monotonically increasing.
size = (int)(currentProducerIndex - currentConsumerIndex);
// If observed size exceeds the allowed capacity we must know we have the
// wrong number, try again.
} while (size > capacity);
// Size is between 0 and capacity, but not necessarily an exact representation.
return size;
}
view raw gistfile1.java hosted with ❤ by GitHub

Is it really solved?
  • This will at least return a value that is in the correct range.
  • This will NOT ALWAYS return the right or correct number. The only way to get the absolutely correct number would be to somehow read both indexes atomically.
  • Atomically reading 2 disjoint longs is not possible... You can (on recent x86) atomically read 2 adjacent longs, but that is:
    • Not possible in Java
    • A recipe for false sharing
  • We could block the producers/consumers from making progress while we read the 2 values, but that would kill lock freedom.
  • This method is lock free, not wait free. In theory the while loop can spin forever in the same way that a CAS loop can spin forever. That is very unlikely however.
This is as good as it gets baby.

Lesson?

We all know that a thread can get suspended and threads execution can interleave, but it is easy to forget/overlook that when looking at a line of code. This issue is in a way just as naive as incrementing a volatile field and expecting atomicity.
Sometimes there is no atomicity to be found and we just have to come to terms with the best approximation to a good answer we can get...
A further lesson here is that having your code reviewed by others is an amazingly effective way to find bugs, and learn :-)


Thursday, 26 June 2014

Notes on False Sharing

{This post is part of a long running series on lock free queues, checkout the full index to get more context here}
I've recently given a talk on queues and related optimisations, but due to the limitations of time, the Universe and Everything else had to cut short the deep dive into false sharing. Here however I'm not limited by such worldly concerns, so here is more detail for those who can stomach it.

What mean False Sharing?

Described before on this blog and in many other places (IntelusenixMr. T), but assuming you are not already familiar with the issue:
"false sharing is a performance-degrading usage pattern that can arise in systems with distributed, coherent caches at the size of the smallest resource block managed by the caching mechanism. When a system participant attempts to periodically access data that will never be altered by another party, but that data shares a cache block with data that is altered, the caching protocol may force the first participant to reload the whole unit despite a lack of logical necessity. The caching system is unaware of activity within this block and forces the first participant to bear the caching system overhead required by true shared access of a resource." - From Wikipedia
Makes sense, right?
I tried my hand previously at explaining the phenomena in terms of the underlying coherency protocol which boils down to the same issue. These explanations often leave people still confused and I spent some time trying to reduce the explanation so that the idea can get across in a presentation... I'm pretty sure it still left people confused.
Let's try again.
The simplest explanation I've come up with to date is:
  1. Memory is cached at a granularity of a cache line (assumed 64 bytes, which is typical, in the following examples but may be a smaller or larger power of 2, determined by the hardware you run on)
  2. Both reading and writing to a memory location from a particular CPU require the presence of a copy of the memory location in the reading/writing CPU cache. The required location is 'viewed' or cached at the granularity of a cache line.
  3. When a line is not in a threads cache we experience a cache miss as we go searching for a valid copy of it.
  4. Once a line is in our cache it may get evicted (when more memory is required) or invalidated (because a value in that line was changed by another CPU). Otherwise it just stays there.
  5. A write to a cache line invalidates ALL cached copies of that line for ALL CPUs except the writer.
I think there is a mistaken intuition most of us have when it comes to memory access, which is that we think of it as direct. In reality however you access memory through the CPU cache hierarchy and it is in the attempt to make this system coherent and performant that the hardware undoes our intuition. Rather than thinking of all CPUs accessing a global memory pool we should have a mental model that is closer to a distributed version control system as illustrated in this excellent post on memory barriers.
Given the above setting of the stage we can say that false sharing occurs when Thread 1 invalidates (i.e writes to) a cache line required by Thread2 (for reading/writing) even though Thread2 and Thread1 are not accessing the same location.
The cost of false sharing is therefore observable as cache misses, and the cause of false sharing is the write to the cache line (coupled with the granularity of cache coherency). Ergo, the symptom of false sharing can be observed by looking at hardware events for your process or thread:
  • If you are on linux you can use perf. (Checkout the awesome perf integration in JMH!)
  • On linux/windows/mac Intel machines there is PCM.
  • From Java you can use Overseer
  • ... look for hardware counters on google.
Now assuming you have established that the number of cache misses you experience is excessive, the challenge remains in finding the sources of these cache misses. Sadly for Java there is no free utility that I'm aware of which can easily point you to the source of your false sharing problem, C/C++ developers are spoilt for choice by comparison. Some usual suspects to consider however are any shared data constructs, for example queues.


Active False Sharing: Busy Counters

Lets start with a queue with 2 fields which share the same cache line where each field is being updated by a separate thread. In this example we have an SPSC queue with the following structure:
class SpscQueue<E> {
long producerIndex = 0;
long consumerIndex = 0;
E[] buffer = new ...[<capacity>];
//... constructor and so on
boolean offer(E e) {
// Update producerIndex and write to the buffer at the index offset
// Something to the effect of:
buffer[producerIndex++ % buffer.length] = e;
}
E poll(){
// Update consumerIndex and read from the buffer at the index offset
// Something to the effect of:
return buffer[consumerIndex++ % buffer.length];
}
}
view raw gistfile1.java hosted with ❤ by GitHub

Ignoring the actual correct implementation and edge cases to make this a working queue (you can read more about how that part works here), we can see that offer will be called by one thread, and poll by another as this is an SPSC queue for delivering messages between 2 threads. Each thread will update a distinct field, but as the fields are all jammed tight together we know False Sharing is bound to follow.
The solution at the time of writing seems to be the introduction of padding by means of class inheritance which is covered in detail here. This will add up to an unattractive class as follows:
class PIndex
{ long producerIndex = 0; }
class Pad extends PIndex
{ long p1,p2,p3,p4,p5,p6,p7; }
class SpscQueue<E> extends Pad {
long consumerIndex = 0;
E[] buffer;
// methods as before
}
view raw gistfile1.java hosted with ❤ by GitHub

Yipppeee! the counters are no longer false sharing! That little Pad in the middle will make them feel all fresh and secure.



Passive False Sharing: Hot Neighbours

But the journey is far from over (stay strong), because in False Sharing you don't have to write to a shared cache line to suffer. In the above example both the producer and the consumer are reading the reference to buffer which is sharing a cache line with naughty Mr. consumerIndex. Every time we update consumerIndex an angel loses her wings! As if the wings business wasn't enough, the write also invalidates the cache line for the producer which is trying to read the buffer so that it can write to the damn thing. The solution?
MORE PADDING!!
class PIndex
{ long producerIndex = 0; }
class Pad1 extends PIndex
{ long p1,p2,p3,p4,p5,p6,p7; }
class Buffer extends Pad1
{ E[] buffer; }
class Pad2 extends Buffer
{ long p1,p2,p3,p4,p5,p6,p7; }
class SpscQueue<E> extends Pad2 {
long consumerIndex = 0;
// methods as before
}
view raw gistfile1.java hosted with ❤ by GitHub

Surely now everyone will just get along?

The Object Next Door

So now we have the consumer and the producer feverishly updating their respective indexes and we know the indexes are not interfering with each other or with buffer, but what about the objects allocated before/after our queue object? Imagine we allocate 2 instances of the queue above and they end up next to each other in memory, we'd be practically back to where we started with the consumerIndex on the tail end of one object false sharing with the producer index at the head end of the other. In fact, the conclusion we are approaching here is that rapidly mutated fields make very poor neighbours altogether, be it to other fields in the same class or to neighbouring classes. The solution?
EVEN MORE PADDING!!! 
class Pad0
{ long p1,p2,p3,p4,p5,p6,p7; }
class PIndex extends Pad0
{ long producerIndex = 0; }
class Pad1 extends PIndex
{ long p1,p2,p3,p4,p5,p6,p7; }
class Buffer extends Pad1
{ E[] buffer; }
class Pad2 extends Buffer
{ long p1,p2,p3,p4,p5,p6,p7; }
class CIndex extends Pad2
{ long consumerIndex = 0; }
class SpscQueue<E> extends CIndex {
long p1,p2,p3,p4,p5,p6,p7;
// methods as before
}
view raw gistfile1.java hosted with ❤ by GitHub
Seems a tad extreme don't it? But hell, at least now we're safe, right?

Buffers are people too

So we have our queue padded like a new born on it's first trip out the house in winter, but what about Miss buffer? Arrays in Java are objects and as such the buffer field merely point to another object and is not inlined into the queue (as it can be in C/C++ for instance). This means the buffer is also subject to potentially causing or suffering from false sharing. This is particularly true for the length field on the array which will get invalidated by writes into the first few elements of the queue, but equally true for any object allocated before or after the array. For circular array queues we know the producer will go around writing to all the elements and come back to the beginning, so the middle elements will be naturally padded. If the array is small enough and the message passing rate is high enough this can have the same effect as any hot field. Alternatively we might experience an uneven behaviour for the queue as the elements around the edges of the array suffer false sharing while the middle ones don't.
Since we cannot extend the array object and pad it to our needs we can over-allocate the buffer and never use the initial/last slots:
... As Before ...
class Buffer extends Pad1
// Allocate extra 16 elements either side
{ E[] buffer = new ...[16 + <capacity> + 16]; }
... As Before ...
class SpscQueue<E> extends CIndex {
boolean offer(E e) {
// Skip first 16 and only use the inner <capacity> elements
buffer[16 + (producerIndex++ % (buffer.length - 32))] = e;
}
E poll(){
// Skip first 16 and only use the inner <capacity> elements
return buffer[16 + (consumerIndex++ % (buffer.length - 32))];
}
}
view raw gistfile1.java hosted with ❤ by GitHub
We cannot prevent false sharing between neighbours and the array length, what we can do is hoist it into our own class field and use our copy instead of the buffer field. This can also result in better locality if the hoisted length value is hosted in a field alongside the array itself.
Note that this is costing us some computational overhead to compute the offset index instead of the natural one, but in practice we can implement queues such that the we achieve the same effect with no overhead at all.

Protection against the Elements

While on this topic (I do go on about it... hmmm) we can observe that the producer and the consumer threads can still experience false sharing on the cache line holding adjacent elements in the buffer array. This is something we can perhaps handle more effectively if we had arrays of structs in Java (see the value types JEP and structured arrays proposal), and if the queue was to be a queue of such structs. Reality being the terrible place that it is, this is not the case yet. If we could prove our application indeed suffered from this issue, could we solve it?
Yes, but at a questionable price...
... As before ...
class Buffer extends Pad1
// Allocate ELEMENT_SIZE elements per element
{ E[] buffer = new ...[16 + (<capacity> * ELEMENT_SIZE)+ 16]; }
... As before ...
class SpscQueue<E> extends CIndex {
boolean offer(E e) {
// Use up ELEMENT_SIZE elements per element so there are
// ELEMENT_SIZE-1 empty elements between adjacent items
buffer[16 + (producerIndex++ % (buffer.length - 32)) * ELEMENT_SIZE] = e;
}
E poll(){
// Use up ELEMENT_SIZE elements per element so there are
// ELEMENT_SIZE-1 empty elements between adjacent items
return buffer[16 + (consumerIndex++ % (buffer.length - 32)) * ELEMENT_SIZE];
}
}
view raw gistfile1.java hosted with ❤ by GitHub

If we allocate extra elements for each reference we mean to store in the queue we can use empty elements to pad the used elements. This will reduce the density of each cache line and as a result the probability of false sharing as less and less elements are in a cache line. This comes a high cost however as we multiply the size of the buffer to make room for the empty slots and actively sabotage the memory throughput as we have less data per read cache line. This is an optimization I am reluctant to straight out recommend as a result, but what the hell? sometimes it might helps.

Playing with Dirty Cards

This last one is a nugget and a half. Usually when people say: "It's a JVM/compiler/OS issue" it turns out that it's a code issue written by those same people. But sometimes it is indeed the machine.
In certain cases you might not be the cause of False Sharing. In some cases you might be experiencing False Sharing induced by card marking. I won't spoil it for you, follow the link.

Summary

So there you have it, 6 different cases of False Sharing encountered in the process of optimizing/proofing the piddly SPSC queue. Each one had an impact, some more easily measured than others. The take away here is not "Pad every object and field to death" as that will be detrimental in most cases, much like making every field volatile. But this is an issue worth keeping in mind when writing shared data structures, and in particular when considering highly contended ones.
We've discussed 2 types of False Sharing:

  • Active: where both threads update distinct locations on the same cache line. This will severely impact both threads as they both experience cache misses and cause repeated invalidation of the cache line.
  • Passive: where one thread writes to and another reads from distinct locations on the same cache line. This will have a major impact on the reading thread with many reads resulting in a cache miss. This will also have an impact on the writer as the cache line sharing overhead is experienced.
These were identified and mitigated in several forms:

  1. Neighbouring hot/hot fields (Active)
  2. Neighbouring hot/cold fields (Passive)
  3. Neighbouring objects (potential for both Active and Passive)
  4. Objects neighbouring an array and the array elements (same as 3, but worth highlighting as a subtlety that arrays are separate objects)
  5. Array elements and array length field (Passive as the length is never mutated)
  6. Distinct elements sharing the same cache line (Passive in above example, but normally Active as consumer nulls out the reference on poll)
  7. -XX:+UseCondCardMark - see Dave Dice article

What should you do?
  1. Consider fields concurrently accessed by different threads, in particular frequently written to fields and their neighbours.
  2. Consider Active AND Passive false sharing.
  3. Consider (and be considerate to) your neighbours.
  4. All padding comes at the cost of using up valuable cache space for isolation. If over used (like in the elements padding) performance can suffer.
And finally: love thy neighbour, just a little bit and with adequate protection and appropriate padding, but still love'em.
Belgium! Belgium! Belgium! Belgium! Belgium! Belgium! Belgium! Belgium! Belgium! Belgium! Belgium!