Psychosomatic, Lobotomy, Saw: Disassembling a JMH Nano-Benchmark

{UPDATE 03/09/14: If you come here looking for JMH related content start at the new and improved JMH Resources Page and branch out from there!}
I often feel it is nano-benchmarks that give microbenchmarks a bad name (that and the fact MBMs tend to sell crack and their young bodies). Putting to one side the latter issue for bleeding heart liberalists to solve, we are left with the former. In this post I'd like to help the budding nano-benchmark writer resolve and investigate the embarrassing dilemma of: "What just happened?"
"What just happened?" is a question you should almost always ask yourself when running a nano-benchmark. The chances of the compiler finding out your benchmark does nothing, or that significant part of your benchmark can be omitted, are surprisingly large. This is partly a case of extreme cleverness of compiler writers and partly the simplicity of the benchmark code potentially leaving the door open to optimisations perhaps not possible in the wild. The best way to answer the question is to have a look at the assembly end result of your benchmark code.
Hipster developer that I am, I use JMH to write microbenchmarks. Chances are you should too if you are writing nano/micro benchmarks as it goes a long way toward solving common issues. In the rest of this post we'll be looking at the assembly produced by JMH benchmarks and explaining away the framework so that you can more easily find your way in your own benchmark.

The NOOP benchmark

I started with the observation that nano-benchmarks sometime get optimized away, if they did they'd have the same end result as this benchmark:

	@BenchmarkMode(Mode.AverageTime)
	@OutputTimeUnit(TimeUnit.NANOSECONDS)
	@State(Scope.Thread)
	public class BaselineBenchmarks {
	@Benchmark
	public void noop() {
	}
	}

view raw gistfile1.java hosted with ❤ by GitHub

Exciting stuff! So we measure nothing at all. How are we measuring this? JMH generates some code around a call to the above method that will do the measurement:

	public void noop_avgt_jmhLoop(InfraControl control, RawResults result,
	BaselineBenchmarks_1_jmh benchmark,
	Blackhole_1_jmh l_blackhole1_1) throws Throwable {
	long operations = 0;
	long realTime = 0;
	result.startTime = System.nanoTime();
	do {
	benchmark.noop(); // <-- the original noop()
	operations++;
	} while(!control.isDone);
	result.stopTime = System.nanoTime();
	result.realTime = realTime;
	result.operations = operations;
	}

view raw gistfile1.java hosted with ❤ by GitHub

So we have a while loop, spinning on the isDone flag and counting how many times we can manage to execute it until someone tells us to stop (by setting the isDone flag to true). It follows therefore that the measurement overhead is:

Reading the volatile field isDone (L1 hitting read, predictable)
Incrementing a counter (on the stack)

But healthy skepticism is what this is all about, let's see what the generated assembly looks like! I'll be gentle, assembly is often hard on the eyes.

Getting The Assembly Output

To try this at home you'll need a drink, a JVM setup to print assembley and the sample code. Build the project with maven and you run the benchmark and generate the assembly using the following command:

$JAVA_HOME/bin/java -XX:+UnlockDiagnosticVMOptions -XX:CompileCommand=print,*.noop_avgt_jmhLoop -XX:PrintAssemblyOptions=intel -XX:-UseCompressedOops -jar target/microbenchmarks.jar -i 5 -wi 5 -f 0 ".*.noop" > noop.ass

I'm only printing the measurement method, using the Intel syntax instead of the default AT&T and disabling compressed oops to get simpler output for this particular exercise. The output will contain several versions of the compiled method, I will be discussing the final version which is the last in the output.
Now we got the assembly printed we can get familiar with the structure of the JMH measurement loop as it is translated into assembly:

	Decoding compiled method 0x00007fe5a106bb90:
	Code:
	[Entry Point]
	[Constants]
	# {method} 'noop_avgt_jmhLoop' '(Lorg/openjdk/jmh/runner/InfraControl;Lorg/openjdk/jmh/results/RawResults;Lpsy/lob/saw/generated/BaselineBenchmarks_noop$BaselineBenchmarks_1_jmh;Lpsy/lob/saw/generate
	d/BaselineBenchmarks_noop$Blackhole_1_jmh;)V' in 'psy/lob/saw/generated/BaselineBenchmarks_noop'
	# this: rsi:rsi = 'psy/lob/saw/generated/BaselineBenchmarks_noop'
	# parm0: rdx:rdx = 'org/openjdk/jmh/runner/InfraControl'
	# parm1: rcx:rcx = 'org/openjdk/jmh/results/RawResults'
	# parm2: r8:r8 = 'psy/lob/saw/generated/BaselineBenchmarks_noop$BaselineBenchmarks_1_jmh'
	# parm3: r9:r9 = 'psy/lob/saw/generated/BaselineBenchmarks_noop$Blackhole_1_jmh'
	# [sp+0x20] (sp of caller)
	0x00007fe5a106bce0: cmp rax,QWORD PTR [rsi+0x8]
	0x00007fe5a106bce4: jne 0x00007fe5a1037960 ; {runtime_call}
	0x00007fe5a106bcea: xchg ax,ax ; [NW] NOP
	0x00007fe5a106bcec: nop DWORD PTR [rax+0x0] ; [NW] NOP, Align the function body
	[Verified Entry Point]
	0x00007fe5a106bcf0: mov DWORD PTR [rsp-0x14000],eax
	0x00007fe5a106bcf7: push rbp
	0x00007fe5a106bcf8: sub rsp,0x10 ;*synchronization entry
	; - psy.lob.saw.generated.BaselineBenchmarks_noop::noop_avgt_jmhLoop@-1 (line 156)
	0x00007fe5a106bcfc: mov r13,rdx ; [NW] control ref is now in R13
	0x00007fe5a106bcff: mov rbp,r8 ; [NW] benchmark ref is now in RBP
	0x00007fe5a106bd02: mov rbx,rcx ; [NW] result is now in RBX

view raw gistfile1.asm hosted with ❤ by GitHub

This is just the preliminaries for the method, so not much to see except noting which reference is in which register to help interpret the rest of the code. The comments in the printout are generated by the JVM, my comments are prefixed with [NW].
Once all the pieces are in place we can move on to some actual work.

Measurement Loop: 2 Timestamps diverged in a yellow wood

Refresh you memory of what the java code above does and let's see if we can see it here:

	0x00007fe5a106bd05: mov r10,0x7fe5abd38f20 ; [NW] Setup the System.nanoTime() address for call
	0x00007fe5a106bd0f: call r10 ;*invokestatic nanoTime
	; - psy.lob.saw.generated.BaselineBenchmarks_noop::noop_avgt_jmhLoop@7 (line 158)
	; [NW] RBX is result, RBX+0x20 is result.startTime
	0x00007fe5a106bd12: mov QWORD PTR [rbx+0x20],rax ;*putfield startTime
	; - psy.lob.saw.generated.BaselineBenchmarks_noop::noop_avgt_jmhLoop@10 (line 158)
	; implicit exception: dispatches to 0x00007fe5a106bd81
	0x00007fe5a106bd16: mov r11,rbp ; [NW] R11 = RBP which is l_baselinebenchmarks0_0

	; [NW] the following 2 lines mean: "if (l_baselinebenchmarks0_0 == null) throw new NullPointerException();"
	0x00007fe5a106bd19: test r11,r11 ; [NW] R11 & R11
	0x00007fe5a106bd1c: je 0x00007fe5a106bd70 ;*invokevirtual noop
	; - psy.lob.saw.generated.BaselineBenchmarks_noop::noop_avgt_jmhLoop@14 (line 160)
	; [NW] R13 is control
	0x00007fe5a106bd1e: movzx r10d,BYTE PTR [r13+0x9c] ;*getfield isDone
	; - psy.lob.saw.generated.BaselineBenchmarks_noop::noop_avgt_jmhLoop@24 (line 162)
	; implicit exception: dispatches to 0x00007fe5a106bd95
	; [NW] EBP is the lower half of RBP, set it to 1
	0x00007fe5a106bd26: mov ebp,0x1

	; [NW] following 2 lines mean: "if (isDone == true) goto FINISH;"
	0x00007fe5a106bd2b: test r10d,r10d
	0x00007fe5a106bd2e: jne 0x00007fe5a106bd47 ;*aload_3
	; - psy.lob.saw.generated.BaselineBenchmarks_noop::noop_avgt_jmhLoop@13 (line 160)
	; [NW] LOOP START:
	0x00007fe5a106bd30: movzx r10d,BYTE PTR [r13+0x9c] ;*getfield isDone
	; - psy.lob.saw.generated.BaselineBenchmarks_noop::noop_avgt_jmhLoop@24 (line 162)
	; [NW] RBP is operations, so do "operations++;"
	0x00007fe5a106bd38: add rbp,0x1 ; OopMap{r11=Oop rbx=Oop r13=Oop off=92}
	;*ifeq
	; - psy.lob.saw.generated.BaselineBenchmarks_noop::noop_avgt_jmhLoop@27 (line 162)
	; [NW] Safepoint!
	0x00007fe5a106bd3c: test DWORD PTR [rip+0xb5482be],eax # 0x00007fe5ac5b4000
	; {poll}
	; [NW] following 2 lines mean: "if (isDone != true) goto LOOP START;"
	0x00007fe5a106bd42: test r10d,r10d
	0x00007fe5a106bd45: je 0x00007fe5a106bd30 ;*aload_2
	; - psy.lob.saw.generated.BaselineBenchmarks_noop::noop_avgt_jmhLoop@30 (line 163)
	; [NW] FINISH : Finished measuring
	0x00007fe5a106bd47: mov r10,0x7fe5abd38f20 ; [NW] Setup the System.nanoTime() address for call
	0x00007fe5a106bd51: call r10 ;*invokestatic nanoTime
	; - psy.lob.saw.generated.BaselineBenchmarks_noop::noop_avgt_jmhLoop@31 (line 163)
	; [NW] Populate the result fields
	0x00007fe5a106bd54: mov QWORD PTR [rbx+0x10],rbp ;*putfield operations
	; - psy.lob.saw.generated.BaselineBenchmarks_noop::noop_avgt_jmhLoop@46 (line 165)
	0x00007fe5a106bd58: mov QWORD PTR [rbx+0x28],rax ;*putfield stopTime
	; - psy.lob.saw.generated.BaselineBenchmarks_noop::noop_avgt_jmhLoop@34 (line 163)

	; [NW] Note that realTime variable is gone, we just set the field to 0
	0x00007fe5a106bd5c: mov QWORD PTR [rbx+0x18],0x0 ;*putfield realTime
	; - psy.lob.saw.generated.BaselineBenchmarks_noop::noop_avgt_jmhLoop@40 (line 164)

	; [NW] method wrap up mechanics, including another safepoint before returning
	0x00007fe5a106bd64: add rsp,0x10
	0x00007fe5a106bd68: pop rbp
	0x00007fe5a106bd69: test DWORD PTR [rip+0xb548291],eax # 0x00007fe5ac5b4000
	; {poll_return}
	0x00007fe5a106bd6f: ret

view raw gistfile1.asm hosted with ❤ by GitHub

Have a sip and scan slowly. Here's some nuggets to consider:

As expected the noop() method is not called and any mention of it is gone from the measurement loop.
The first iteration of the loop has been 'peeled', this is common practice.
Even though we never call noop(), we still have to do the null check for the benchmark reference.
The sharp of eye reader will have noticed the redundant realTime variable in the generated measurement loop, so has the JIT compiler and it has been replaced with setting the result.realTime field directly to 0.
RBP is an 8 byte register, EBP is the lower half of the same register. Setting EBP to 1 in the peeled first iteration is the same as setting RBP to 1.
The measurement loop includes a safepoint! put that down as further measurement overhead.

This is the simplest benchmark one can write with JMH. On my test machine (an Intel Xeon E5-2697 v2 @ 2.70GHz) doing nothing is quite fast at 0.288 ns/op.
As you may have expected, reading the generated assembly is not so pleasant, I find the generated comments are very helpful for orientation and the timestamp calls on either side of the measurement loop help in zooming in on the important bits.

A Nano-Benchmark: i++

Nothing says "nano-benchmark" like benchmarking a single operation. Let's have a go at it!

	int i;
	@Benchmark
	public void increment() {
	i++;
	}

view raw gistfile1.java hosted with ❤ by GitHub

The generated loop is the same, but this time that crafty old JIT compiler cannot just do nothing with our code. We will finally learn the true cost of incrementing an integer! Given the overhead includes a long increment already I might even guess the cost at 0.25 ns/op, so maybe the result reported by JMH will be 0.5 ns/op? A warm fuzzy feeling of wisdom.

But when I run this benchmark on the same machine I learn to my dismay that incrementing an integer takes 1.794 ns/op according to my JMH benchmark. Damn integers! why does the JVM torture us so with slow integer increments?

This is a silly benchmark, and the result makes absolutely no sense as an estimate of the cost of the ++ operator on integers. So what does it mean? Could it be that the JIT compiler failed us? Lets have a look at the assembly:

	; [NW] START: R8 is benchmark, [r8+0x10] is benchmark.i
	0x00007f1d35068640: inc DWORD PTR [r8+0x10] ;*invokevirtual increment
	; - psy.lob.saw.generated.BaselineBenchmarks_increment::increment_avgt_jmhLoop@14 (line 160)
	; [NW] R13 is control
	0x00007f1d35068644: movzx r10d,BYTE PTR [r13+0x9c] ;*getfield isDone
	; - psy.lob.saw.generated.BaselineBenchmarks_increment::increment_avgt_jmhLoop@24 (line 162)
	; [NW] operations++
	0x00007f1d3506864c: add rbp,0x1 ; OopMap{r8=Oop rbx=Oop r13=Oop off=112}
	;*ifeq
	; - psy.lob.saw.generated.BaselineBenchmarks_increment::increment_avgt_jmhLoop@27 (line 162)
	; [NW] Safepoint
	0x00007f1d35068650: test DWORD PTR [rip+0xb5119aa],eax # 0x00007f1d4057a000
	; {poll}

	; [NW] if(!isDone) goto START
	0x00007f1d35068656: test r10d,r10d
	0x00007f1d35068659: je 0x00007f1d35068640 ;*aload_2
	; - psy.lob.saw.generated.BaselineBenchmarks_increment::increment_avgt_jmhLoop@30 (line 163)

view raw gistfile1.asm hosted with ❤ by GitHub

So why is the reported cost so much higher than our expectation?

What just happened?

My increment method got translated perfectly into: "inc DWORD PTR [r8+0x10]". There is no compiler issue. The comparison I made between incrementing the operations counter and incrementing the benchmark field is flawed/misguided/stupid/ignorant when taking into account the benchmark framework.

The context in which we increment operations is:

It's a long variable allocated on the stack
It's used in a very small methods where there is no register pressure
It follows that operations is always a register
ADD/INC on a register cost very little (it's the cheapest thing you can do usually)

The context in which we increment benchmark.i is:

It's a field on the benchmark object
It's subject to happens-before rules so cannot be hoisted into a register inside the measurement loop (because control.isDone is a volatile read, see this post for more detail)
It follows that benchmark.i is always a memory location
INC on a memory location is not so cheap (by nano benchmark standards)

Consulting with the most excellent Agner Fog instructions tables tells me that for Ivy Bridge the latency for INC on memory is 6 cycles, while the latency on ADD for a register is 1. This indeed agrees to some extent with the cost reported by JMH (assuming 0.288 was for one cycle, 0.288 * 6 = 1.728 which is pretty close to 1.794). But that's bad analysis. The truth is that cost is not additive, particularly when nano-benchmarks are concerned. In this case the cost of the INC seems to swallow up the baseline cost we measured before.

Is there something wrong with JMH? I don't think so. If we take the benchmark to be "an attempt at estimating the cost of calling a method which increments a field" then I would argue we got a valid answer. It's not the only answer however. Calling the same method in a context which allows further optimizations would yield a different answer.

4 comments:

Anonymous4 May 2016, 20:18:00
Thanks for this post. I was going through an assembly listing and came across this instruction-
0x00000000027f1879: vaddsd xmm0,xmm0,QWORD PTR [rip+0xffffffffffffff9f] # 0x00000000027f1820

I am not able to figure out what "rip+0xffffffffffffff9f" signifies and what the "# 0x00000000027f1820" comment is for. Any tips? I can see a reference to this in the constants section-

[Constants]
0x00000000027f1820 (offset: 0): 0x001f0ac0 0x402c48c6001f0ac0
0x00000000027f1824 (offset: 4): 0x402c48c6

Is there any documentation on how to interpret this section?
Anonymous16 Aug 2017, 11:21:00
Great article, thank you! Turning compressed oops off makes it easier to understand what's going on.
One remark: listing 3 line 22 says: "control ref is now in RDX", shouldn't it say "control ref is now in R13"?

Note: only a member of this blog may post a comment.

Wednesday, 27 August 2014

Disassembling a JMH Nano-Benchmark

The NOOP benchmark

Getting The Assembly Output

Measurement Loop: 2 Timestamps diverged in a yellow wood

A Nano-Benchmark: i++

What just happened?

4 comments: