Tuesday, 12 August 2014

The volatile read suprise

{UPDATE 03/09/14: If you come here looking for JMH related content start at the new and improved JMH Resources Page and branch out from there!}
On occasion, and for perfectly good reasons, I find myself trying to answer such deep existential questions as this one. Which is faster:
As you can see from the sample I turn to JMH to help me resolve such questions. If you know not what JMH is you may enjoy reading previous posts on the subject (start with this one). In short it is a jolly awesome framework for benchmarking java:
  • @Benchmark annotated methods will get benchmarked
  • The framework will pass in a Blackhole object that will pretend to 'consume' the values you pass into it and thus prevent the JIT compiler from dead code eliminating the above loops to nothing.
Assuming we are all on the same page with this snippet above, let the game begin!

Yummy yummy sugar!

So I ran the above benchmarks on some heavy duty benchmarking machine and get the following results for different array sizes:
It sure looks like that syntactic sugar is much better! more than twice as fast! awesome?

Must give us pause

At this point we could either:
  1. Declare syntactic sugar the clear winner and never write the old style for loops ever again 'cause they be slow like everything old! we hates them old loops! hates them!
  2. Worry that we are being a bit stupid
I get very little sleep and I was never very bright, so I'll go for 2. 
This benchmark result seems off, it's not what we expect. It would make sense for the JVM to make both loops the same, and yet they seem to work out very differently. Why, god? whhhhhhhy?
The above benchmark is a tiny piece of code, and is a fine example of a nano-benchmark (to use the term coined by Shipilev for benchmarks of nano-second scale). These are pretty suspect benchmarks at the best of time so you want to be quite alert when trying to make sense of them. When stuff doesn't make sense it is best to see what the JIT compiler made of your code and hit the assembly! Printing the JIT generated assembly is a neat party trick (sure to win you new friends and free drinks) and results in loads of funky text getting thrown at you. I was going to do a whole walk through the assembly but I have promises to keep and miles to walk before I sleep (some other time, I promise). So lets just skip to the WTF moment.

Into the hole

The assembly code for the goodOldLoop is long and painful to read through, and that in itself is a clue. Once you work out the control flow you'll sit there scratching your head and wondering. The thing that stands out (when the assembly smoke clears) is that bunn is loaded on every iteration, bunn.length is loaded and an array boundary check happens. This is surely a terrible way to interpret a for loop... 
The culprit turns out to be a volatile read in Blackhole.consume:

The above method ensures that a consumed value will not be subject to DCE even if it is completely predictable. The values for b1, b2 being volatile cannot be assumed to stay the same and so require re-examination. The side effect is however that we now have a volatile load in the midst of our for loop. A volatile load of one value requires the JVM to load all subsequent loads from memory to force happens before relationships, in this case the field bunn is reloaded on every iteration of the loop. If bunn may have changed then it's length may have also changed... sadness follows. To test this theory we can make a third loop:
This performs much like the sweet syntactic sugar version:

Lessons learnt?

  • Nano benchmarks and their results are hard to interpret. When in doubt read the assembly, when not in doubt smack yourself to regain doubt and read the assembly. It's very easy for a phenomena you are not looking to benchmark to slip into the benchmark.
  • Sugar is not necessarily bad for you. In the above case the syntactic sugar interpretation by the JVM was a better match to our intuition than the explicit old school loop. By being explicit we inhibited optimisation, despite intending the same thing. The enhanced for loop, as the JLS calls it, is semantically different from the basic for loop in that it assumes some sort of snapshot iterator taken at the beginning of the loop and used throughout, which for primitive arrays means taking the form used in goodOldLoopReturns.
  • Blackhole.consume is also a memory barrier, and these come with some side effects you may not expect. In larger benchmarks these may be negligible but in nano benchmarks every little thing counts. This is a fine use case for a 'weak' volatile read, one which requires a memory read but no memory barrier(previous post on the compound meaning of the volatile access)

3 comments:

  1. I guess, there's still a considerable overhead for using the BlackHole as it looks like you need 2 cycles per iteration. Computing a sum of the bytes should need only one cycle per iteration. With manual loop unrolling and using two accumulators you may be able to get below one cycle per iteration. OTOH I assume you're mostly interested in volatile semantics and this is unrelated.

    ReplyDelete
  2. Mh, I did not get it. A volatile write happens-before a volatile read of the same variable. But here we only have volatile reads and non-volatile accesses on the array etc. Why does the jvm need to load all subsequent non-volatile variables from memory, too? And why subsequent although it is called happens-before?

    ReplyDelete
    Replies
    1. And that, my friend, is why it's called a 'surprise'! ;-)
      Volatile reads/writes have ordering effects for ALL loads and stores.
      To support HB a volatile read forces all following loads to be from memory. Otherwise imagine:
      int a;
      volatile int v;
      T1: a = 1; v = 1;
      T2: r1 = a; if(v==1) assert a==1;
      now consider there's an undefined amount of time between r1 = a; and the assertion. The assertion cannot fail according to JMM.

      Delete