Psychosomatic, Lobotomy, Saw: The volatile read suprise

{UPDATE 03/09/14: If you come here looking for JMH related content start at the new and improved JMH Resources Page and branch out from there!}
On occasion, and for perfectly good reasons, I find myself trying to answer such deep existential questions as this one. Which is faster:

	@BenchmarkMode(Mode.AverageTime)
	@OutputTimeUnit(TimeUnit.NANOSECONDS)
	@State(Scope.Thread)
	public class LoopyBenchmarks {
	@Param({ "32", "1024", "32768" })
	int size;

	byte[] bunn;

	@Setup
	public void prepare() {
	bunn = new byte[size];
	}

	@Benchmark
	public void goodOldLoop(Blackhole fox) {
	for (int y = 0; y < bunn.length; y++) { // good old C style for (the win?)
	fox.consume(bunn[y]);
	}
	}

	@Benchmark
	public void sweetLoop(Blackhole fox) {
	for (byte bunny : bunn) { // syntactic sugar loop goodness
	fox.consume(bunny);
	}
	}
	}

view raw gistfile1.java hosted with ❤ by GitHub

As you can see from the sample I turn to JMH to help me resolve such questions. If you know not what JMH is you may enjoy reading previous posts on the subject (start with this one). In short it is a jolly awesome framework for benchmarking java:

@Benchmark annotated methods will get benchmarked
The framework will pass in a Blackhole object that will pretend to 'consume' the values you pass into it and thus prevent the JIT compiler from dead code eliminating the above loops to nothing.

Assuming we are all on the same page with this snippet above, let the game begin!

Yummy yummy sugar!

So I ran the above benchmarks on some heavy duty benchmarking machine and get the following results for different array sizes:

	Benchmark (size) Score Score error Units
	goodOldLoop 32 46.630 0.097 ns/op
	goodOldLoop 1024 1199.338 0.705 ns/op
	goodOldLoop 32768 37813.600 56.081 ns/op
	sweetLoop 32 19.304 0.010 ns/op
	sweetLoop 1024 475.141 1.227 ns/op
	sweetLoop 32768 14295.800 36.071 ns/op

view raw gistfile1.txt hosted with ❤ by GitHub

It sure looks like that syntactic sugar is much better! more than twice as fast! awesome?

Must give us pause

At this point we could either:

Declare syntactic sugar the clear winner and never write the old style for loops ever again 'cause they be slow like everything old! we hates them old loops! hates them!
Worry that we are being a bit stupid

I get very little sleep and I was never very bright, so I'll go for 2.

This benchmark result seems off, it's not what we expect. It would make sense for the JVM to make both loops the same, and yet they seem to work out very differently. Why, god? whhhhhhhy?

The above benchmark is a tiny piece of code, and is a fine example of a nano-benchmark (to use the term coined by Shipilev for benchmarks of nano-second scale). These are pretty suspect benchmarks at the best of time so you want to be quite alert when trying to make sense of them. When stuff doesn't make sense it is best to see what the JIT compiler made of your code and hit the assembly! Printing the JIT generated assembly is a neat party trick (sure to win you new friends and free drinks) and results in loads of funky text getting thrown at you. I was going to do a whole walk through the assembly but I have promises to keep and miles to walk before I sleep (some other time, I promise). So lets just skip to the WTF moment.

Into the hole

The assembly code for the goodOldLoop is long and painful to read through, and that in itself is a clue. Once you work out the control flow you'll sit there scratching your head and wondering. The thing that stands out (when the assembly smoke clears) is that bunn is loaded on every iteration, bunn.length is loaded and an array boundary check happens. This is surely a terrible way to interpret a for loop...

The culprit turns out to be a volatile read in Blackhole.consume:

	//...
	public volatile byte b1, b2;
	public volatile BlackholeL2 nullBait = null;

	/**
	* Consume object. This call provides a side effect preventing JIT to eliminate dependent computations.
	*
	* @param b object to consume.
	*/
	public final void consume(byte b) {
	if (b == b1 & b == b2) {
	// SHOULD NEVER HAPPEN
	nullBait.b1 = b; // implicit null pointer exception
	}
	}

view raw gistfile1.java hosted with ❤ by GitHub

The above method ensures that a consumed value will not be subject to DCE even if it is completely predictable. The values for b1, b2 being volatile cannot be assumed to stay the same and so require re-examination. The side effect is however that we now have a volatile load in the midst of our for loop. A volatile load of one value requires the JVM to load all subsequent loads from memory to force happens before relationships, in this case the field bunn is reloaded on every iteration of the loop. If bunn may have changed then it's length may have also changed... sadness follows. To test this theory we can make a third loop:

	@Benchmark
	public void goodOldLoopReturns(Blackhole fox) {
	byte[] sunn = bunn; // make a local copy of the field
	for (int y = 0; y < sunn.length; y++) {
	fox.consume(sunn[y]);
	}
	}

view raw gistfile1.java hosted with ❤ by GitHub

This performs much like the sweet syntactic sugar version:

	Benchmark (size) Score Score error Units
	goodOldLoopReturns 32 19.306 0.045 ns/op
	goodOldLoopReturns 1024 476.493 1.190 ns/op
	goodOldLoopReturns 32768 14292.286 16.046 ns/op
	sweetLoop 32 19.304 0.010 ns/op
	sweetLoop 1024 475.141 1.227 ns/op
	sweetLoop 32768 14295.800 36.071 ns/op

view raw gistfile1.txt hosted with ❤ by GitHub

Lessons learnt?

Nano benchmarks and their results are hard to interpret. When in doubt read the assembly, when not in doubt smack yourself to regain doubt and read the assembly. It's very easy for a phenomena you are not looking to benchmark to slip into the benchmark.
Sugar is not necessarily bad for you. In the above case the syntactic sugar interpretation by the JVM was a better match to our intuition than the explicit old school loop. By being explicit we inhibited optimisation, despite intending the same thing. The enhanced for loop, as the JLS calls it, is semantically different from the basic for loop in that it assumes some sort of snapshot iterator taken at the beginning of the loop and used throughout, which for primitive arrays means taking the form used in goodOldLoopReturns.
Blackhole.consume is also a memory barrier, and these come with some side effects you may not expect. In larger benchmarks these may be negligible but in nano benchmarks every little thing counts. This is a fine use case for a 'weak' volatile read, one which requires a memory read but no memory barrier(previous post on the compound meaning of the volatile access)

3 comments:

maaartinus4 Sept 2014, 07:35:00
I guess, there's still a considerable overhead for using the BlackHole as it looks like you need 2 cycles per iteration. Computing a sum of the bytes should need only one cycle per iteration. With manual loop unrolling and using two accumulators you may be able to get below one cycle per iteration. OTOH I assume you're mostly interested in volatile semantics and this is unrelated.
Anonymous13 Feb 2015, 18:02:00
Mh, I did not get it. A volatile write happens-before a volatile read of the same variable. But here we only have volatile reads and non-volatile accesses on the array etc. Why does the jvm need to load all subsequent non-volatile variables from memory, too? And why subsequent although it is called happens-before?

Note: only a member of this blog may post a comment.

Tuesday, 12 August 2014

The volatile read suprise

Yummy yummy sugar!

Must give us pause

Into the hole

Lessons learnt?

3 comments: