I often feel it is nano-benchmarks that give microbenchmarks a bad name (that and the fact MBMs tend to sell crack and their young bodies). Putting to one side the latter issue for bleeding heart liberalists to solve, we are left with the former. In this post I'd like to help the budding nano-benchmark writer resolve and investigate the embarrassing dilemma of: "What just happened?"
"What just happened?" is a question you should almost always ask yourself when running a nano-benchmark. The chances of the compiler finding out your benchmark does nothing, or that significant part of your benchmark can be omitted, are surprisingly large. This is partly a case of extreme cleverness of compiler writers and partly the simplicity of the benchmark code potentially leaving the door open to optimisations perhaps not possible in the wild. The best way to answer the question is to have a look at the assembly end result of your benchmark code.
Hipster developer that I am, I use JMH to write microbenchmarks. Chances are you should too if you are writing nano/micro benchmarks as it goes a long way toward solving common issues. In the rest of this post we'll be looking at the assembly produced by JMH benchmarks and explaining away the framework so that you can more easily find your way in your own benchmark.
The NOOP benchmark
I started with the observation that nano-benchmarks sometime get optimized away, if they did they'd have the same end result as this benchmark:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
@BenchmarkMode(Mode.AverageTime) | |
@OutputTimeUnit(TimeUnit.NANOSECONDS) | |
@State(Scope.Thread) | |
public class BaselineBenchmarks { | |
@Benchmark | |
public void noop() { | |
} | |
} |
Exciting stuff! So we measure nothing at all. How are we measuring this? JMH generates some code around a call to the above method that will do the measurement:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
public void noop_avgt_jmhLoop(InfraControl control, RawResults result, | |
BaselineBenchmarks_1_jmh benchmark, | |
Blackhole_1_jmh l_blackhole1_1) throws Throwable { | |
long operations = 0; | |
long realTime = 0; | |
result.startTime = System.nanoTime(); | |
do { | |
benchmark.noop(); // <-- the original noop() | |
operations++; | |
} while(!control.isDone); | |
result.stopTime = System.nanoTime(); | |
result.realTime = realTime; | |
result.operations = operations; | |
} |
So we have a while loop, spinning on the isDone flag and counting how many times we can manage to execute it until someone tells us to stop (by setting the isDone flag to true). It follows therefore that the measurement overhead is:
- Reading the volatile field isDone (L1 hitting read, predictable)
- Incrementing a counter (on the stack)
Getting The Assembly Output
To try this at home you'll need a drink, a JVM setup to print assembley and the sample code. Build the project with maven and you run the benchmark and generate the assembly using the following command:
$JAVA_HOME/bin/java -XX:+UnlockDiagnosticVMOptions -XX:CompileCommand=print,*.noop_avgt_jmhLoop -XX:PrintAssemblyOptions=intel -XX:-UseCompressedOops -jar target/microbenchmarks.jar -i 5 -wi 5 -f 0 ".*.noop" > noop.assI'm only printing the measurement method, using the Intel syntax instead of the default AT&T and disabling compressed oops to get simpler output for this particular exercise. The output will contain several versions of the compiled method, I will be discussing the final version which is the last in the output.
Now we got the assembly printed we can get familiar with the structure of the JMH measurement loop as it is translated into assembly:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Decoding compiled method 0x00007fe5a106bb90: | |
Code: | |
[Entry Point] | |
[Constants] | |
# {method} 'noop_avgt_jmhLoop' '(Lorg/openjdk/jmh/runner/InfraControl;Lorg/openjdk/jmh/results/RawResults;Lpsy/lob/saw/generated/BaselineBenchmarks_noop$BaselineBenchmarks_1_jmh;Lpsy/lob/saw/generate | |
d/BaselineBenchmarks_noop$Blackhole_1_jmh;)V' in 'psy/lob/saw/generated/BaselineBenchmarks_noop' | |
# this: rsi:rsi = 'psy/lob/saw/generated/BaselineBenchmarks_noop' | |
# parm0: rdx:rdx = 'org/openjdk/jmh/runner/InfraControl' | |
# parm1: rcx:rcx = 'org/openjdk/jmh/results/RawResults' | |
# parm2: r8:r8 = 'psy/lob/saw/generated/BaselineBenchmarks_noop$BaselineBenchmarks_1_jmh' | |
# parm3: r9:r9 = 'psy/lob/saw/generated/BaselineBenchmarks_noop$Blackhole_1_jmh' | |
# [sp+0x20] (sp of caller) | |
0x00007fe5a106bce0: cmp rax,QWORD PTR [rsi+0x8] | |
0x00007fe5a106bce4: jne 0x00007fe5a1037960 ; {runtime_call} | |
0x00007fe5a106bcea: xchg ax,ax ; [NW] NOP | |
0x00007fe5a106bcec: nop DWORD PTR [rax+0x0] ; [NW] NOP, Align the function body | |
[Verified Entry Point] | |
0x00007fe5a106bcf0: mov DWORD PTR [rsp-0x14000],eax | |
0x00007fe5a106bcf7: push rbp | |
0x00007fe5a106bcf8: sub rsp,0x10 ;*synchronization entry | |
; - psy.lob.saw.generated.BaselineBenchmarks_noop::noop_avgt_jmhLoop@-1 (line 156) | |
0x00007fe5a106bcfc: mov r13,rdx ; [NW] control ref is now in R13 | |
0x00007fe5a106bcff: mov rbp,r8 ; [NW] benchmark ref is now in RBP | |
0x00007fe5a106bd02: mov rbx,rcx ; [NW] result is now in RBX |
This is just the preliminaries for the method, so not much to see except noting which reference is in which register to help interpret the rest of the code. The comments in the printout are generated by the JVM, my comments are prefixed with [NW].
Once all the pieces are in place we can move on to some actual work.
Measurement Loop: 2 Timestamps diverged in a yellow wood
Refresh you memory of what the java code above does and let's see if we can see it here:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
0x00007fe5a106bd05: mov r10,0x7fe5abd38f20 ; [NW] Setup the System.nanoTime() address for call | |
0x00007fe5a106bd0f: call r10 ;*invokestatic nanoTime | |
; - psy.lob.saw.generated.BaselineBenchmarks_noop::noop_avgt_jmhLoop@7 (line 158) | |
; [NW] RBX is result, RBX+0x20 is result.startTime | |
0x00007fe5a106bd12: mov QWORD PTR [rbx+0x20],rax ;*putfield startTime | |
; - psy.lob.saw.generated.BaselineBenchmarks_noop::noop_avgt_jmhLoop@10 (line 158) | |
; implicit exception: dispatches to 0x00007fe5a106bd81 | |
0x00007fe5a106bd16: mov r11,rbp ; [NW] R11 = RBP which is l_baselinebenchmarks0_0 | |
; [NW] the following 2 lines mean: "if (l_baselinebenchmarks0_0 == null) throw new NullPointerException();" | |
0x00007fe5a106bd19: test r11,r11 ; [NW] R11 & R11 | |
0x00007fe5a106bd1c: je 0x00007fe5a106bd70 ;*invokevirtual noop | |
; - psy.lob.saw.generated.BaselineBenchmarks_noop::noop_avgt_jmhLoop@14 (line 160) | |
; [NW] R13 is control | |
0x00007fe5a106bd1e: movzx r10d,BYTE PTR [r13+0x9c] ;*getfield isDone | |
; - psy.lob.saw.generated.BaselineBenchmarks_noop::noop_avgt_jmhLoop@24 (line 162) | |
; implicit exception: dispatches to 0x00007fe5a106bd95 | |
; [NW] EBP is the lower half of RBP, set it to 1 | |
0x00007fe5a106bd26: mov ebp,0x1 | |
; [NW] following 2 lines mean: "if (isDone == true) goto FINISH;" | |
0x00007fe5a106bd2b: test r10d,r10d | |
0x00007fe5a106bd2e: jne 0x00007fe5a106bd47 ;*aload_3 | |
; - psy.lob.saw.generated.BaselineBenchmarks_noop::noop_avgt_jmhLoop@13 (line 160) | |
; [NW] LOOP START: | |
0x00007fe5a106bd30: movzx r10d,BYTE PTR [r13+0x9c] ;*getfield isDone | |
; - psy.lob.saw.generated.BaselineBenchmarks_noop::noop_avgt_jmhLoop@24 (line 162) | |
; [NW] RBP is operations, so do "operations++;" | |
0x00007fe5a106bd38: add rbp,0x1 ; OopMap{r11=Oop rbx=Oop r13=Oop off=92} | |
;*ifeq | |
; - psy.lob.saw.generated.BaselineBenchmarks_noop::noop_avgt_jmhLoop@27 (line 162) | |
; [NW] Safepoint! | |
0x00007fe5a106bd3c: test DWORD PTR [rip+0xb5482be],eax # 0x00007fe5ac5b4000 | |
; {poll} | |
; [NW] following 2 lines mean: "if (isDone != true) goto LOOP START;" | |
0x00007fe5a106bd42: test r10d,r10d | |
0x00007fe5a106bd45: je 0x00007fe5a106bd30 ;*aload_2 | |
; - psy.lob.saw.generated.BaselineBenchmarks_noop::noop_avgt_jmhLoop@30 (line 163) | |
; [NW] FINISH : Finished measuring | |
0x00007fe5a106bd47: mov r10,0x7fe5abd38f20 ; [NW] Setup the System.nanoTime() address for call | |
0x00007fe5a106bd51: call r10 ;*invokestatic nanoTime | |
; - psy.lob.saw.generated.BaselineBenchmarks_noop::noop_avgt_jmhLoop@31 (line 163) | |
; [NW] Populate the result fields | |
0x00007fe5a106bd54: mov QWORD PTR [rbx+0x10],rbp ;*putfield operations | |
; - psy.lob.saw.generated.BaselineBenchmarks_noop::noop_avgt_jmhLoop@46 (line 165) | |
0x00007fe5a106bd58: mov QWORD PTR [rbx+0x28],rax ;*putfield stopTime | |
; - psy.lob.saw.generated.BaselineBenchmarks_noop::noop_avgt_jmhLoop@34 (line 163) | |
; [NW] Note that realTime variable is gone, we just set the field to 0 | |
0x00007fe5a106bd5c: mov QWORD PTR [rbx+0x18],0x0 ;*putfield realTime | |
; - psy.lob.saw.generated.BaselineBenchmarks_noop::noop_avgt_jmhLoop@40 (line 164) | |
; [NW] method wrap up mechanics, including another safepoint before returning | |
0x00007fe5a106bd64: add rsp,0x10 | |
0x00007fe5a106bd68: pop rbp | |
0x00007fe5a106bd69: test DWORD PTR [rip+0xb548291],eax # 0x00007fe5ac5b4000 | |
; {poll_return} | |
0x00007fe5a106bd6f: ret |
- As expected the noop() method is not called and any mention of it is gone from the measurement loop.
- The first iteration of the loop has been 'peeled', this is common practice.
- Even though we never call noop(), we still have to do the null check for the benchmark reference.
- The sharp of eye reader will have noticed the redundant realTime variable in the generated measurement loop, so has the JIT compiler and it has been replaced with setting the result.realTime field directly to 0.
- RBP is an 8 byte register, EBP is the lower half of the same register. Setting EBP to 1 in the peeled first iteration is the same as setting RBP to 1.
- The measurement loop includes a safepoint! put that down as further measurement overhead.
As you may have expected, reading the generated assembly is not so pleasant, I find the generated comments are very helpful for orientation and the timestamp calls on either side of the measurement loop help in zooming in on the important bits.
A Nano-Benchmark: i++
Nothing says "nano-benchmark" like benchmarking a single operation. Let's have a go at it!
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
int i; | |
@Benchmark | |
public void increment() { | |
i++; | |
} |
The generated loop is the same, but this time that crafty old JIT compiler cannot just do nothing with our code. We will finally learn the true cost of incrementing an integer! Given the overhead includes a long increment already I might even guess the cost at 0.25 ns/op, so maybe the result reported by JMH will be 0.5 ns/op? A warm fuzzy feeling of wisdom.
But when I run this benchmark on the same machine I learn to my dismay that incrementing an integer takes 1.794 ns/op according to my JMH benchmark. Damn integers! why does the JVM torture us so with slow integer increments?
This is a silly benchmark, and the result makes absolutely no sense as an estimate of the cost of the ++ operator on integers. So what does it mean? Could it be that the JIT compiler failed us? Lets have a look at the assembly:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
; [NW] START: R8 is benchmark, [r8+0x10] is benchmark.i | |
0x00007f1d35068640: inc DWORD PTR [r8+0x10] ;*invokevirtual increment | |
; - psy.lob.saw.generated.BaselineBenchmarks_increment::increment_avgt_jmhLoop@14 (line 160) | |
; [NW] R13 is control | |
0x00007f1d35068644: movzx r10d,BYTE PTR [r13+0x9c] ;*getfield isDone | |
; - psy.lob.saw.generated.BaselineBenchmarks_increment::increment_avgt_jmhLoop@24 (line 162) | |
; [NW] operations++ | |
0x00007f1d3506864c: add rbp,0x1 ; OopMap{r8=Oop rbx=Oop r13=Oop off=112} | |
;*ifeq | |
; - psy.lob.saw.generated.BaselineBenchmarks_increment::increment_avgt_jmhLoop@27 (line 162) | |
; [NW] Safepoint | |
0x00007f1d35068650: test DWORD PTR [rip+0xb5119aa],eax # 0x00007f1d4057a000 | |
; {poll} | |
; [NW] if(!isDone) goto START | |
0x00007f1d35068656: test r10d,r10d | |
0x00007f1d35068659: je 0x00007f1d35068640 ;*aload_2 | |
; - psy.lob.saw.generated.BaselineBenchmarks_increment::increment_avgt_jmhLoop@30 (line 163) |
What just happened?
My increment method got translated perfectly into: "inc DWORD PTR [r8+0x10]". There is no compiler issue. The comparison I made between incrementing the operations counter and incrementing the benchmark field is flawed/misguided/stupid/ignorant when taking into account the benchmark framework.
The context in which we increment operations is:
- It's a long variable allocated on the stack
- It's used in a very small methods where there is no register pressure
- It follows that operations is always a register
- ADD/INC on a register cost very little (it's the cheapest thing you can do usually)
- It's a field on the benchmark object
- It's subject to happens-before rules so cannot be hoisted into a register inside the measurement loop (because control.isDone is a volatile read, see this post for more detail)
- It follows that benchmark.i is always a memory location
- INC on a memory location is not so cheap (by nano benchmark standards)
Is there something wrong with JMH? I don't think so. If we take the benchmark to be "an attempt at estimating the cost of calling a method which increments a field" then I would argue we got a valid answer. It's not the only answer however. Calling the same method in a context which allows further optimizations would yield a different answer.
Thanks for this post. I was going through an assembly listing and came across this instruction-
ReplyDelete0x00000000027f1879: vaddsd xmm0,xmm0,QWORD PTR [rip+0xffffffffffffff9f] # 0x00000000027f1820
I am not able to figure out what "rip+0xffffffffffffff9f" signifies and what the "# 0x00000000027f1820" comment is for. Any tips? I can see a reference to this in the constants section-
[Constants]
0x00000000027f1820 (offset: 0): 0x001f0ac0 0x402c48c6001f0ac0
0x00000000027f1824 (offset: 4): 0x402c48c6
Is there any documentation on how to interpret this section?
Hi,
DeleteNo offence, but your comment is not really on topic for the post.
I can't make sense of your assembly without further context.
I think you should post this on SO, the comments section is not the place.
Sorry.
Great article, thank you! Turning compressed oops off makes it easier to understand what's going on.
ReplyDeleteOne remark: listing 3 line 22 says: "control ref is now in RDX", shouldn't it say "control ref is now in R13"?
Well done, and many thanks :-), fixed now
Delete