Psychosomatic, Lobotomy, Saw: August 2014

Wednesday, 27 August 2014

Disassembling a JMH Nano-Benchmark

{UPDATE 03/09/14: If you come here looking for JMH related content start at the new and improved JMH Resources Page and branch out from there!}
I often feel it is nano-benchmarks that give microbenchmarks a bad name (that and the fact MBMs tend to sell crack and their young bodies). Putting to one side the latter issue for bleeding heart liberalists to solve, we are left with the former. In this post I'd like to help the budding nano-benchmark writer resolve and investigate the embarrassing dilemma of: "What just happened?"
"What just happened?" is a question you should almost always ask yourself when running a nano-benchmark. The chances of the compiler finding out your benchmark does nothing, or that significant part of your benchmark can be omitted, are surprisingly large. This is partly a case of extreme cleverness of compiler writers and partly the simplicity of the benchmark code potentially leaving the door open to optimisations perhaps not possible in the wild. The best way to answer the question is to have a look at the assembly end result of your benchmark code.
Hipster developer that I am, I use JMH to write microbenchmarks. Chances are you should too if you are writing nano/micro benchmarks as it goes a long way toward solving common issues. In the rest of this post we'll be looking at the assembly produced by JMH benchmarks and explaining away the framework so that you can more easily find your way in your own benchmark.

The NOOP benchmark

I started with the observation that nano-benchmarks sometime get optimized away, if they did they'd have the same end result as this benchmark:

	@BenchmarkMode(Mode.AverageTime)
	@OutputTimeUnit(TimeUnit.NANOSECONDS)
	@State(Scope.Thread)
	public class BaselineBenchmarks {
	@Benchmark
	public void noop() {
	}
	}

view raw gistfile1.java hosted with ❤ by GitHub

Exciting stuff! So we measure nothing at all. How are we measuring this? JMH generates some code around a call to the above method that will do the measurement:

	public void noop_avgt_jmhLoop(InfraControl control, RawResults result,
	BaselineBenchmarks_1_jmh benchmark,
	Blackhole_1_jmh l_blackhole1_1) throws Throwable {
	long operations = 0;
	long realTime = 0;
	result.startTime = System.nanoTime();
	do {
	benchmark.noop(); // <-- the original noop()
	operations++;
	} while(!control.isDone);
	result.stopTime = System.nanoTime();
	result.realTime = realTime;
	result.operations = operations;
	}

view raw gistfile1.java hosted with ❤ by GitHub

So we have a while loop, spinning on the isDone flag and counting how many times we can manage to execute it until someone tells us to stop (by setting the isDone flag to true). It follows therefore that the measurement overhead is:

Reading the volatile field isDone (L1 hitting read, predictable)
Incrementing a counter (on the stack)

But healthy skepticism is what this is all about, let's see what the generated assembly looks like! I'll be gentle, assembly is often hard on the eyes.

Getting The Assembly Output

To try this at home you'll need a drink, a JVM setup to print assembley and the sample code. Build the project with maven and you run the benchmark and generate the assembly using the following command:

$JAVA_HOME/bin/java -XX:+UnlockDiagnosticVMOptions -XX:CompileCommand=print,*.noop_avgt_jmhLoop -XX:PrintAssemblyOptions=intel -XX:-UseCompressedOops -jar target/microbenchmarks.jar -i 5 -wi 5 -f 0 ".*.noop" > noop.ass

I'm only printing the measurement method, using the Intel syntax instead of the default AT&T and disabling compressed oops to get simpler output for this particular exercise. The output will contain several versions of the compiled method, I will be discussing the final version which is the last in the output.
Now we got the assembly printed we can get familiar with the structure of the JMH measurement loop as it is translated into assembly:

	Decoding compiled method 0x00007fe5a106bb90:
	Code:
	[Entry Point]
	[Constants]
	# {method} 'noop_avgt_jmhLoop' '(Lorg/openjdk/jmh/runner/InfraControl;Lorg/openjdk/jmh/results/RawResults;Lpsy/lob/saw/generated/BaselineBenchmarks_noop$BaselineBenchmarks_1_jmh;Lpsy/lob/saw/generate
	d/BaselineBenchmarks_noop$Blackhole_1_jmh;)V' in 'psy/lob/saw/generated/BaselineBenchmarks_noop'
	# this: rsi:rsi = 'psy/lob/saw/generated/BaselineBenchmarks_noop'
	# parm0: rdx:rdx = 'org/openjdk/jmh/runner/InfraControl'
	# parm1: rcx:rcx = 'org/openjdk/jmh/results/RawResults'
	# parm2: r8:r8 = 'psy/lob/saw/generated/BaselineBenchmarks_noop$BaselineBenchmarks_1_jmh'
	# parm3: r9:r9 = 'psy/lob/saw/generated/BaselineBenchmarks_noop$Blackhole_1_jmh'
	# [sp+0x20] (sp of caller)
	0x00007fe5a106bce0: cmp rax,QWORD PTR [rsi+0x8]
	0x00007fe5a106bce4: jne 0x00007fe5a1037960 ; {runtime_call}
	0x00007fe5a106bcea: xchg ax,ax ; [NW] NOP
	0x00007fe5a106bcec: nop DWORD PTR [rax+0x0] ; [NW] NOP, Align the function body
	[Verified Entry Point]
	0x00007fe5a106bcf0: mov DWORD PTR [rsp-0x14000],eax
	0x00007fe5a106bcf7: push rbp
	0x00007fe5a106bcf8: sub rsp,0x10 ;*synchronization entry
	; - psy.lob.saw.generated.BaselineBenchmarks_noop::noop_avgt_jmhLoop@-1 (line 156)
	0x00007fe5a106bcfc: mov r13,rdx ; [NW] control ref is now in R13
	0x00007fe5a106bcff: mov rbp,r8 ; [NW] benchmark ref is now in RBP
	0x00007fe5a106bd02: mov rbx,rcx ; [NW] result is now in RBX

view raw gistfile1.asm hosted with ❤ by GitHub

This is just the preliminaries for the method, so not much to see except noting which reference is in which register to help interpret the rest of the code. The comments in the printout are generated by the JVM, my comments are prefixed with [NW].
Once all the pieces are in place we can move on to some actual work.

Measurement Loop: 2 Timestamps diverged in a yellow wood

Refresh you memory of what the java code above does and let's see if we can see it here:

	0x00007fe5a106bd05: mov r10,0x7fe5abd38f20 ; [NW] Setup the System.nanoTime() address for call
	0x00007fe5a106bd0f: call r10 ;*invokestatic nanoTime
	; - psy.lob.saw.generated.BaselineBenchmarks_noop::noop_avgt_jmhLoop@7 (line 158)
	; [NW] RBX is result, RBX+0x20 is result.startTime
	0x00007fe5a106bd12: mov QWORD PTR [rbx+0x20],rax ;*putfield startTime
	; - psy.lob.saw.generated.BaselineBenchmarks_noop::noop_avgt_jmhLoop@10 (line 158)
	; implicit exception: dispatches to 0x00007fe5a106bd81
	0x00007fe5a106bd16: mov r11,rbp ; [NW] R11 = RBP which is l_baselinebenchmarks0_0

	; [NW] the following 2 lines mean: "if (l_baselinebenchmarks0_0 == null) throw new NullPointerException();"
	0x00007fe5a106bd19: test r11,r11 ; [NW] R11 & R11
	0x00007fe5a106bd1c: je 0x00007fe5a106bd70 ;*invokevirtual noop
	; - psy.lob.saw.generated.BaselineBenchmarks_noop::noop_avgt_jmhLoop@14 (line 160)
	; [NW] R13 is control
	0x00007fe5a106bd1e: movzx r10d,BYTE PTR [r13+0x9c] ;*getfield isDone
	; - psy.lob.saw.generated.BaselineBenchmarks_noop::noop_avgt_jmhLoop@24 (line 162)
	; implicit exception: dispatches to 0x00007fe5a106bd95
	; [NW] EBP is the lower half of RBP, set it to 1
	0x00007fe5a106bd26: mov ebp,0x1

	; [NW] following 2 lines mean: "if (isDone == true) goto FINISH;"
	0x00007fe5a106bd2b: test r10d,r10d
	0x00007fe5a106bd2e: jne 0x00007fe5a106bd47 ;*aload_3
	; - psy.lob.saw.generated.BaselineBenchmarks_noop::noop_avgt_jmhLoop@13 (line 160)
	; [NW] LOOP START:
	0x00007fe5a106bd30: movzx r10d,BYTE PTR [r13+0x9c] ;*getfield isDone
	; - psy.lob.saw.generated.BaselineBenchmarks_noop::noop_avgt_jmhLoop@24 (line 162)
	; [NW] RBP is operations, so do "operations++;"
	0x00007fe5a106bd38: add rbp,0x1 ; OopMap{r11=Oop rbx=Oop r13=Oop off=92}
	;*ifeq
	; - psy.lob.saw.generated.BaselineBenchmarks_noop::noop_avgt_jmhLoop@27 (line 162)
	; [NW] Safepoint!
	0x00007fe5a106bd3c: test DWORD PTR [rip+0xb5482be],eax # 0x00007fe5ac5b4000
	; {poll}
	; [NW] following 2 lines mean: "if (isDone != true) goto LOOP START;"
	0x00007fe5a106bd42: test r10d,r10d
	0x00007fe5a106bd45: je 0x00007fe5a106bd30 ;*aload_2
	; - psy.lob.saw.generated.BaselineBenchmarks_noop::noop_avgt_jmhLoop@30 (line 163)
	; [NW] FINISH : Finished measuring
	0x00007fe5a106bd47: mov r10,0x7fe5abd38f20 ; [NW] Setup the System.nanoTime() address for call
	0x00007fe5a106bd51: call r10 ;*invokestatic nanoTime
	; - psy.lob.saw.generated.BaselineBenchmarks_noop::noop_avgt_jmhLoop@31 (line 163)
	; [NW] Populate the result fields
	0x00007fe5a106bd54: mov QWORD PTR [rbx+0x10],rbp ;*putfield operations
	; - psy.lob.saw.generated.BaselineBenchmarks_noop::noop_avgt_jmhLoop@46 (line 165)
	0x00007fe5a106bd58: mov QWORD PTR [rbx+0x28],rax ;*putfield stopTime
	; - psy.lob.saw.generated.BaselineBenchmarks_noop::noop_avgt_jmhLoop@34 (line 163)

	; [NW] Note that realTime variable is gone, we just set the field to 0
	0x00007fe5a106bd5c: mov QWORD PTR [rbx+0x18],0x0 ;*putfield realTime
	; - psy.lob.saw.generated.BaselineBenchmarks_noop::noop_avgt_jmhLoop@40 (line 164)

	; [NW] method wrap up mechanics, including another safepoint before returning
	0x00007fe5a106bd64: add rsp,0x10
	0x00007fe5a106bd68: pop rbp
	0x00007fe5a106bd69: test DWORD PTR [rip+0xb548291],eax # 0x00007fe5ac5b4000
	; {poll_return}
	0x00007fe5a106bd6f: ret

view raw gistfile1.asm hosted with ❤ by GitHub

Have a sip and scan slowly. Here's some nuggets to consider:

As expected the noop() method is not called and any mention of it is gone from the measurement loop.
The first iteration of the loop has been 'peeled', this is common practice.
Even though we never call noop(), we still have to do the null check for the benchmark reference.
The sharp of eye reader will have noticed the redundant realTime variable in the generated measurement loop, so has the JIT compiler and it has been replaced with setting the result.realTime field directly to 0.
RBP is an 8 byte register, EBP is the lower half of the same register. Setting EBP to 1 in the peeled first iteration is the same as setting RBP to 1.
The measurement loop includes a safepoint! put that down as further measurement overhead.

This is the simplest benchmark one can write with JMH. On my test machine (an Intel Xeon E5-2697 v2 @ 2.70GHz) doing nothing is quite fast at 0.288 ns/op.
As you may have expected, reading the generated assembly is not so pleasant, I find the generated comments are very helpful for orientation and the timestamp calls on either side of the measurement loop help in zooming in on the important bits.

A Nano-Benchmark: i++

Nothing says "nano-benchmark" like benchmarking a single operation. Let's have a go at it!

	int i;
	@Benchmark
	public void increment() {
	i++;
	}

view raw gistfile1.java hosted with ❤ by GitHub

The generated loop is the same, but this time that crafty old JIT compiler cannot just do nothing with our code. We will finally learn the true cost of incrementing an integer! Given the overhead includes a long increment already I might even guess the cost at 0.25 ns/op, so maybe the result reported by JMH will be 0.5 ns/op? A warm fuzzy feeling of wisdom.

But when I run this benchmark on the same machine I learn to my dismay that incrementing an integer takes 1.794 ns/op according to my JMH benchmark. Damn integers! why does the JVM torture us so with slow integer increments?

This is a silly benchmark, and the result makes absolutely no sense as an estimate of the cost of the ++ operator on integers. So what does it mean? Could it be that the JIT compiler failed us? Lets have a look at the assembly:

	; [NW] START: R8 is benchmark, [r8+0x10] is benchmark.i
	0x00007f1d35068640: inc DWORD PTR [r8+0x10] ;*invokevirtual increment
	; - psy.lob.saw.generated.BaselineBenchmarks_increment::increment_avgt_jmhLoop@14 (line 160)
	; [NW] R13 is control
	0x00007f1d35068644: movzx r10d,BYTE PTR [r13+0x9c] ;*getfield isDone
	; - psy.lob.saw.generated.BaselineBenchmarks_increment::increment_avgt_jmhLoop@24 (line 162)
	; [NW] operations++
	0x00007f1d3506864c: add rbp,0x1 ; OopMap{r8=Oop rbx=Oop r13=Oop off=112}
	;*ifeq
	; - psy.lob.saw.generated.BaselineBenchmarks_increment::increment_avgt_jmhLoop@27 (line 162)
	; [NW] Safepoint
	0x00007f1d35068650: test DWORD PTR [rip+0xb5119aa],eax # 0x00007f1d4057a000
	; {poll}

	; [NW] if(!isDone) goto START
	0x00007f1d35068656: test r10d,r10d
	0x00007f1d35068659: je 0x00007f1d35068640 ;*aload_2
	; - psy.lob.saw.generated.BaselineBenchmarks_increment::increment_avgt_jmhLoop@30 (line 163)

view raw gistfile1.asm hosted with ❤ by GitHub

So why is the reported cost so much higher than our expectation?

What just happened?

My increment method got translated perfectly into: "inc DWORD PTR [r8+0x10]". There is no compiler issue. The comparison I made between incrementing the operations counter and incrementing the benchmark field is flawed/misguided/stupid/ignorant when taking into account the benchmark framework.

The context in which we increment operations is:

It's a long variable allocated on the stack
It's used in a very small methods where there is no register pressure
It follows that operations is always a register
ADD/INC on a register cost very little (it's the cheapest thing you can do usually)

The context in which we increment benchmark.i is:

It's a field on the benchmark object
It's subject to happens-before rules so cannot be hoisted into a register inside the measurement loop (because control.isDone is a volatile read, see this post for more detail)
It follows that benchmark.i is always a memory location
INC on a memory location is not so cheap (by nano benchmark standards)

Consulting with the most excellent Agner Fog instructions tables tells me that for Ivy Bridge the latency for INC on memory is 6 cycles, while the latency on ADD for a register is 1. This indeed agrees to some extent with the cost reported by JMH (assuming 0.288 was for one cycle, 0.288 * 6 = 1.728 which is pretty close to 1.794). But that's bad analysis. The truth is that cost is not additive, particularly when nano-benchmarks are concerned. In this case the cost of the INC seems to swallow up the baseline cost we measured before.

Is there something wrong with JMH? I don't think so. If we take the benchmark to be "an attempt at estimating the cost of calling a method which increments a field" then I would argue we got a valid answer. It's not the only answer however. Calling the same method in a context which allows further optimizations would yield a different answer.

Tuesday, 12 August 2014

The volatile read suprise

{UPDATE 03/09/14: If you come here looking for JMH related content start at the new and improved JMH Resources Page and branch out from there!}
On occasion, and for perfectly good reasons, I find myself trying to answer such deep existential questions as this one. Which is faster:

	@BenchmarkMode(Mode.AverageTime)
	@OutputTimeUnit(TimeUnit.NANOSECONDS)
	@State(Scope.Thread)
	public class LoopyBenchmarks {
	@Param({ "32", "1024", "32768" })
	int size;

	byte[] bunn;

	@Setup
	public void prepare() {
	bunn = new byte[size];
	}

	@Benchmark
	public void goodOldLoop(Blackhole fox) {
	for (int y = 0; y < bunn.length; y++) { // good old C style for (the win?)
	fox.consume(bunn[y]);
	}
	}

	@Benchmark
	public void sweetLoop(Blackhole fox) {
	for (byte bunny : bunn) { // syntactic sugar loop goodness
	fox.consume(bunny);
	}
	}
	}

view raw gistfile1.java hosted with ❤ by GitHub

As you can see from the sample I turn to JMH to help me resolve such questions. If you know not what JMH is you may enjoy reading previous posts on the subject (start with this one). In short it is a jolly awesome framework for benchmarking java:

@Benchmark annotated methods will get benchmarked
The framework will pass in a Blackhole object that will pretend to 'consume' the values you pass into it and thus prevent the JIT compiler from dead code eliminating the above loops to nothing.

Assuming we are all on the same page with this snippet above, let the game begin!

Yummy yummy sugar!

So I ran the above benchmarks on some heavy duty benchmarking machine and get the following results for different array sizes:

	Benchmark (size) Score Score error Units
	goodOldLoop 32 46.630 0.097 ns/op
	goodOldLoop 1024 1199.338 0.705 ns/op
	goodOldLoop 32768 37813.600 56.081 ns/op
	sweetLoop 32 19.304 0.010 ns/op
	sweetLoop 1024 475.141 1.227 ns/op
	sweetLoop 32768 14295.800 36.071 ns/op

view raw gistfile1.txt hosted with ❤ by GitHub

It sure looks like that syntactic sugar is much better! more than twice as fast! awesome?

Must give us pause

At this point we could either:

Declare syntactic sugar the clear winner and never write the old style for loops ever again 'cause they be slow like everything old! we hates them old loops! hates them!
Worry that we are being a bit stupid

I get very little sleep and I was never very bright, so I'll go for 2.

This benchmark result seems off, it's not what we expect. It would make sense for the JVM to make both loops the same, and yet they seem to work out very differently. Why, god? whhhhhhhy?

The above benchmark is a tiny piece of code, and is a fine example of a nano-benchmark (to use the term coined by Shipilev for benchmarks of nano-second scale). These are pretty suspect benchmarks at the best of time so you want to be quite alert when trying to make sense of them. When stuff doesn't make sense it is best to see what the JIT compiler made of your code and hit the assembly! Printing the JIT generated assembly is a neat party trick (sure to win you new friends and free drinks) and results in loads of funky text getting thrown at you. I was going to do a whole walk through the assembly but I have promises to keep and miles to walk before I sleep (some other time, I promise). So lets just skip to the WTF moment.

Into the hole

The assembly code for the goodOldLoop is long and painful to read through, and that in itself is a clue. Once you work out the control flow you'll sit there scratching your head and wondering. The thing that stands out (when the assembly smoke clears) is that bunn is loaded on every iteration, bunn.length is loaded and an array boundary check happens. This is surely a terrible way to interpret a for loop...

The culprit turns out to be a volatile read in Blackhole.consume:

	//...
	public volatile byte b1, b2;
	public volatile BlackholeL2 nullBait = null;

	/**
	* Consume object. This call provides a side effect preventing JIT to eliminate dependent computations.
	*
	* @param b object to consume.
	*/
	public final void consume(byte b) {
	if (b == b1 & b == b2) {
	// SHOULD NEVER HAPPEN
	nullBait.b1 = b; // implicit null pointer exception
	}
	}

view raw gistfile1.java hosted with ❤ by GitHub

The above method ensures that a consumed value will not be subject to DCE even if it is completely predictable. The values for b1, b2 being volatile cannot be assumed to stay the same and so require re-examination. The side effect is however that we now have a volatile load in the midst of our for loop. A volatile load of one value requires the JVM to load all subsequent loads from memory to force happens before relationships, in this case the field bunn is reloaded on every iteration of the loop. If bunn may have changed then it's length may have also changed... sadness follows. To test this theory we can make a third loop:

	@Benchmark
	public void goodOldLoopReturns(Blackhole fox) {
	byte[] sunn = bunn; // make a local copy of the field
	for (int y = 0; y < sunn.length; y++) {
	fox.consume(sunn[y]);
	}
	}

view raw gistfile1.java hosted with ❤ by GitHub

This performs much like the sweet syntactic sugar version:

	Benchmark (size) Score Score error Units
	goodOldLoopReturns 32 19.306 0.045 ns/op
	goodOldLoopReturns 1024 476.493 1.190 ns/op
	goodOldLoopReturns 32768 14292.286 16.046 ns/op
	sweetLoop 32 19.304 0.010 ns/op
	sweetLoop 1024 475.141 1.227 ns/op
	sweetLoop 32768 14295.800 36.071 ns/op

view raw gistfile1.txt hosted with ❤ by GitHub

Lessons learnt?

Nano benchmarks and their results are hard to interpret. When in doubt read the assembly, when not in doubt smack yourself to regain doubt and read the assembly. It's very easy for a phenomena you are not looking to benchmark to slip into the benchmark.
Sugar is not necessarily bad for you. In the above case the syntactic sugar interpretation by the JVM was a better match to our intuition than the explicit old school loop. By being explicit we inhibited optimisation, despite intending the same thing. The enhanced for loop, as the JLS calls it, is semantically different from the basic for loop in that it assumes some sort of snapshot iterator taken at the beginning of the loop and used throughout, which for primitive arrays means taking the form used in goodOldLoopReturns.
Blackhole.consume is also a memory barrier, and these come with some side effects you may not expect. In larger benchmarks these may be negligible but in nano benchmarks every little thing counts. This is a fine use case for a 'weak' volatile read, one which requires a memory read but no memory barrier(previous post on the compound meaning of the volatile access)

Friday, 8 August 2014

The many meanings of volatile read and write

Just a quick note on the topic as I find I keep having this conversation. Volatile fields in Java provide three distinct features:

Atomicity: volatile long and double fields are guaranteed to be atomically written. This is not the case otherwise for long double. See JLS section 17.7 for more details. Also see this excellent argument made by Shipilev on why all fields could be made atomic with no significant downside.
Store/Load to/from memory: a normal field load may get hoisted out of a loop and be done once, a volatile field is prevented from being optimized that way and will be loaded on each iteration. Similarly stores are to memory and will not be optimized.
Global Ordering: A volatile write acts as a StoreLoad barrier thus preventing previous stores from being reordered with following loads. A volatile read acts as a LoadLoad barrier and prevents following loads from happening before it. This is opposed to the meaning of volatile in C/C++ where only other volatile loads/stores are prevented from reordering.

I would personally prefer to have these more refined tools at my disposal for when I need them, but volatile is a 3-in-1 sort of tool...

What about AtomicLong.lazySet?

For those of you wondering (as I did) weather or not AtomicLong.lazySet (A.K.A Unsafe.putOrderedLong) provides atomicity, it would seem the answer is yes. Digging through the JVM source code for the putOrderedLong intrinsic yields the following nugget:

	bool LibraryCallKit::inline_unsafe_ordered_store(BasicType type) {
	// This is another variant of inline_unsafe_access, differing in
	// that it always issues store-store ("release") barrier and ensures
	// store-atomicity (which only matters for "long").
	/* ... all this unpleasant sea of nodes stuff ... not what I want to talk about ... */
	insert_mem_bar(Op_MemBarRelease);
	insert_mem_bar(Op_MemBarCPUOrder);
	// Ensure that the store is atomic for longs: <--- Yay!
	const bool require_atomic_access = true;
	Node* store;
	if (type == T_OBJECT) // reference stores need a store barrier.
	store = store_oop_to_unknown(control(), base, adr, adr_type, val, type, MemNode::release);
	else {
	store = store_to_memory(control(), adr, val, type, adr_type, MemNode::release, require_atomic_access);
	}
	insert_mem_bar(Op_MemBarCPUOrder);
	return true;
	}

view raw gistfile1.cpp hosted with ❤ by GitHub

Look at that perfectly pleasant C++ code! The store is indeed made atomic. We can further test this observation by looking at the generated assembly for a 32 vs 64 bit JVM:

	;A 64 bit JVM:
	mov QWORD PTR [rsi+0x118],rcx ;*invokevirtual putOrderedLong
	; - org.jctools.queues.InlinedCountersSpscConcurrentArrayQueue::tailLazySet@8 (line 131)
	; - org.jctools.queues.InlinedCountersSpscConcurrentArrayQueue::offer@83 (line 163)

	;A 32 bit JVM:
	vmovd xmm0,ecx
	vmovd xmm1,ebx
	vpunpckldq xmm0,xmm0,xmm1
	vmovsd QWORD PTR [esi+0x110],xmm0 ;*invokevirtual putOrderedLong
	; - org.jctools.queues.InlinedCountersSpscConcurrentArrayQueue::tailLazySet@8 (line 131)
	; - org.jctools.queues.InlinedCountersSpscConcurrentArrayQueue::offer@83 (line 163)

view raw gistfile1.asm hosted with ❤ by GitHub

There you go! Atomicity is perserved! Hoorah!