Psychosomatic, Lobotomy, Saw: Intrinsics

Showing posts with label Intrinsics. Show all posts

Monday, 13 April 2015

On Arrays.fill, Intrinsics, SuperWord and SIMD instructions

{This post turned rather long, if you get lazy feel free to skip to the summary}
Let's start at the very beginning, a very good place to start... My very first post on this blog was a short rant on intrinsics, and how they ain't what they seem. In that post I made the following statement:

"intrinsic functions show up as normal methods or native methods"

Which is correct. An intrinsic function is applied as a method substitution. A method call will appear in the code and the compiler will replace it's apparent source-level implementation with a pre-cooked implementation. In some cases intrinsics are sort of compilation cheats, the idea being that some bits of functionality are both very important (i.e. worth while optimizing) and can benefit from a hand crafted solution that will be better than what the compiler can achieve. The end result can be in one of a few flavours:

Method call replaced with a call to a JVM runtime method: E.g. System.arrayCopy is replaced with a call to a method stub generated by the runtime for all array types. This method call is not a JNI call, but it is a static method call that is not inlined.
Method call replaced with one or more instructions inlined: E.g. Unsafe.getByte/compareAndSet/Math.max
Method call replaced with compiler IR implementation: E.g. java.lang.reflect.Array.getLength
A mix of the above: E.g. String.equals is partially implemented in IR, but the array comparison is a call to a method stub.

The intrinsics are all set up in vmSymbols.hpp and if you look, you'll see Arrays.fill is NOT on the list. So why am I talking about Chewbacca? Because it is something like an intrinsic...

The Arrays.fill SIMD Opportunity

Arrays.fill is the Java memset (fills an array with a given value), and just like System.arrayCopy (memcpy in C lingo) is worth the effort to optimize and offers the same kind of opportunity. What opportunity might that be, you ask? the opportunity to use SIMD (Single Instruction Multiple Data) instructions when the underlying CPU offers them (I assume for the sake of discussion AVX enabled CPUs i.e. since Sandy Bridge, I find this listing of intel intrinsics useful to explain and sort through the available instructions). These instructions allow the CPU to operate on up to 256 bit (512 bit soon) chunks of data, thus transforming 32 byte sized MOV instructions into a single wide MOV instruction (E.g. the intel C instrinsic _mm256_storeu_si256 or the corresponding instruction vmovdqu). SIMD instructions are good for all sorts of operations on vectors of data, or arrays, which is why the process of transforming element by element operations into SIMD instructions is also referred to as vectorization.
The actual assembly stub is generated dependent on CPU and available instruction set. For x86 the code is generated by the macroAssembler_x86.cpp, and the observant digger into the code will find it makes use of the widest memory instructions it can identify the processor is capable of. Wider is better baby! If you are not morbidly curious about what the implementation looks like, skip the next wall of assembly and you'll be back in Java land shortly.
Here's what the assembly boils down to when UseAVX>=2/UseSSE>=2/UseUnalignedLoadStores=true:

Roughly speaking the algorithm above is:

Fill up an XMM register with the intended value
Use the XMM register to write 64 byte chunks (2 vmovdqu) until no more are available
Write leftover 32 byte chunk (skipped if no matching leftovers)
Write leftover 8 byte chunks (skipped if no matching leftovers)
Write leftover 4 bytes (skipped if no matching leftovers)
Write leftover 2 bytes (skipped if no matching leftovers)
Write leftover 1 bytes (skipped if no matching leftovers)

It ain't nice, but we do what we gotta for performance! There are variations of the above described across the internets as the done thing for a memset implementation, this might seem complex but is pretty standard... anyway, moving right along.

The Arrays.fill 'intrinsic'

Arrays.fill is different from System.arrayCopy because, as it's absence from vmSymbols suggests, it's not a method substitution kind of intrinsic (so technically not an intrinsic). What is it then? Arrays.fill is a code pattern substitution kind of compiler shortcut, basically looking for this kind of loop:

And replacing it with a call into the JVM memset implementation (I recently learnt the same thing is done by GCC as well, see code to assembly here). The pattern matching bit is done in loopTransform.cpp. This feels enough like an intrinsic grey area that the method doing the pattern match and replace is called intrinsify_fill.
Pattern matching makes this optimization potentially far more powerful than method substitution as the programmer doesn't have to use a special JDK method to convey meaning, they can just express their meaning in code and the compiler 'knows' that this simple loop means 'fill'. Compare that with System.arrayCopy where rolling your own leads to performance that is much worse than that offered by the intrinsic.
Let's prove me right (my favourite thing, beats kittens and all that crap), here's a JMH (see the JMH reference page for more JMH info/examples) benchmark comparing Arrays.fill to a hand rolled fill, and System.arrayCopy to handrolled array copy:
And the results are (Oracle JDK8u40/i7-4770@3.40GHz/Ubuntu, array is 32K in size)?
ArrayFill.fillBytes 561.540 ± 10.814 ns/op
ArrayFill.manualFillBytes 557.901 ± 5.255 ns/op
ArrayFill.manualReversedFillBytes 1017.856 ± 0.425 ns/op
ArrayFill.copyBytes 1300.313 ± 13.482 ns/op
ArrayFill.manualCopyBytes 1477.442 ± 13.030 ns/op

We can verify that the call out to the JVM fill method happens for fillBytes/manualFillBytes by printing out the assembly:

So what have we learnt so far:

Use System.arrayCopy, it is better than your handrolled loop. But surprisingly not hugely better, hmmm.
You don't have to use Arrays.fill, you can roll your own and it works the same. Notice the call out to the fill method. But...
Don't get too creative rolling your own. If you get too funky (like filling the array backwards) it'll fall apart and the 'intrinsic' won't happen. But do note that the reverse fill still has some of that good SIMD stuff going, we'll get to that in a sec.

Are The Other Types Filling The Love?

It all sounds great don't it? Let's see how this pans out for other types. We'll be filling an array of 32KB. To be uniform across data types that means a 16K chars/shorts array, an 8K ints/floats array and a 4K array of longs. I added an 8K array of objects, which is the same size for compressed oops on the Oracle JVM (reference size is 4 bytes, same as an int).
The JMH benchmark code is as you'd expect:
Here's some reasonable expectations:

If no optimizations are present, wider writes are more efficient. It follows that the longFill would be the fastest. But...
Given a clever compiler the fill loop is replaced with the widest writes possible, so there should be no significant difference. But the fill optimization does not cover double/long/object arrays, so we might expect longFill to be the worst performer.
An objects array is not that different from an int array, so performance should be similar. Sure there's a write barrier, but it need only be done once per card (not once for the whole array as I thought initially, god bless Shipilev and PrintAssembly), so that's an extra byte write per card of elements filled. A card is per 512 bytes, each element is 4 bytes, so that's one card per 128 elements. Given there is no fill method implemented for it we may expect it to be slightly worse than the longFill.
We should not rely on expectations, because performance is best measured.

As you'd expect the results are somewhat different than the expectations (Oracle JDK8u40/i7-4770@3.40GHz/Ubuntu):

ArrayFill.fillBytes 561.540 ± 10.814 ns/op

ArrayFill.fillChars 541.901 ± 4.833 ns/op

ArrayFill.fillShorts 532.936 ± 4.508 ns/op

ArrayFill.fillInts 543.165 ± 3.788 ns/op

ArrayFill.fillFloats 537.599 ± 2.323 ns/op

ArrayFill.fillLongs 326.770 ± 3.267 ns/op

ArrayFill.fillDoubles 346.840 ± 5.786 ns/op

ArrayFill.fillObjects 4388.242 ± 11.945 ns/op

Say WOT?
For bytes/chars/shorts/ints/floats Arrays.fill performs very similarly. This much is as expected from the second point above. But filling an array of longs/doubles is better than the others. The funny thing is, there's no fill function implemented for the long array, how come it is so darn quick? Also, why does the objects fill suck quite so badly when compared with the rest (I will not be addressing this last question! I refuse! this post is too fucking long as it is!)?
This is what happens when we turn off the OptimizeFill flag:

ArrayFill.fillBytes 1013.803 ± 0.227 ns/op
ArrayFill.fillChars 323.806 ± 3.879 ns/op
ArrayFill.fillShorts 323.689 ± 4.499 ns/op
ArrayFill.fillInts 326.336 ± 1.559 ns/op
ArrayFill.fillFloats 319.449 ± 2.048 ns/op
ArrayFill.fillLongs 328.692 ± 3.282 ns/op
ArrayFill.fillDoubles 345.035 ± 6.362 ns/op
ArrayFill.fillObjects 4397.130 ± 7.161 ns/op

Strange innit? now we got char/int/long arrays all performing similarly. In fact, with the exception of the byte array, everything is better than it was with the optimization.

Superword to the rescue!

Turns out the JIT compiler is clued up on the topic of SIMD parallelisation by way of Superword Level Parallelism (see the original paper here):

In some respects, superword level parallelism is a restricted form of ILP (Instruction Level Parallelism). ILP techniques have been very successful in the general purpose computing arena, partly because of their ability to find parallelism within basic blocks. In the same way that loop unrolling translates loop level parallelism into ILP, vector parallelism can be transformed into SLP. This realization allows for the parallelization of vectorizable loops using the same basic block analysis. As a result, our algorithm does not require any of the complicated loop transformations typically associated with vectorization. In fact, vector parallelism alone can be uncovered using a simplified version of the SLP compiler algorithm.
...
Superword level parallelism is defined as short SIMD parallelism in which the source and result operands of a SIMD operation are packed in a storage location.
...
Vector parallelism is a subset of superword level parallelism.

The Hotspot compiler implements SLP optimizations in superword.cpp and you are invited to dive into the implementation if you like. I'm going to focus on it's impact here, and to do that I only need to know how to turn it on and off (core competency for any software person). It's on by default, so above results are what happens when it is on, here's what life looks like when it is off too (so -XX:-OptimizeFill -XX:-UseSuperWord):
ArrayFill.fillBytes 8501.270 ± 2.896 ns/op
ArrayFill.fillChars 4286.161 ± 4.935 ns/op
ArrayFill.fillShorts 4286.168 ± 3.146 ns/op
ArrayFill.fillInts 2152.489 ± 2.653 ns/op
ArrayFill.fillFloats 2140.649 ± 2.587 ns/op
ArrayFill.fillLongs 1105.926 ± 2.228 ns/op
ArrayFill.fillDoubles 1105.820 ± 2.393 ns/op
ArrayFill.fillObjects 4392.506 ± 11.678 ns/op

Life is revealed in all it's sucky splendour! This is what happens when the compiler shows you no love... did I say no love? hang on, things can get a bit worse.

Detour: Unsafe? We don't serve you kind here

To all the Unsafe fans, I got some sad news for y'all. Unsafe 'fill' loops are not well loved by the compiler. This is the price of stepping off the beaten path I guess. Consider the following benchmark:
The results are:
ArrayFill.unsafeFillOffheapBytes 9742.621 ± 2.270 ns/op
ArrayFill.unsafeFillOnHeapBytes 12640.019 ± 1.977 ns/op
ArrayFill.fillBytes(for reference) 561.540 ± 10.814 ns/op
The Unsafe variant do not enjoy the 'fill' pattern matching magic, nor do they get the SuperWord optimizations. What can you do? For this kind of thing you should use the Unsafe.setMemory method instead:
With the result:
ArrayFill.unsafeSetOffheapBytes 1259.281 ± 21.294 ns/op
ArrayFill.unsafeSetOnHeapBytes 1275.158 ± 27.950 ns/op
Not quite there, still ~2x worse (why? how come it doesn't just call the bytes fill method? a bit of digging shows it ends up calling the underlying platform's memset...) but beats being 20-25x worse like the handrolled method is.

Summary and Musings

It's the circle of life!

So what did we learn:

There's another kind of 'intrinsic' like optimization, which uses pattern matching to swap a block of code rather than a method. This is employed for memset like memory fill loops (in particular Arrays.fill) intrinsicfication. It's not an intrinsic technically, but you know what I fucking mean.
System.arrayCopy/Arrays.fill implementations utilize SIMD instructions to improve their efficiency. These instructions are not available in plain Java, so some compiler intervention is required.
The JIT compiler is also able to use SuperWord Level Parallelism to derive SIMD code from 'normal' sequential code.
In the case of Arrays.fill, it looks like the SuperWord optimized code is faster than the fill specialized implementation for all types except bytes (on the system under test)
If you use Unsafe you will be excluded from these optimizations.

So I look at this process and I imagine history went something like this:

We want to use SIMD instructions, but the JIT compiler isn't really clever enough to generate them by itself. Memset implementations are rather specialized after all. Let's make life a bit easier for the compiler by creating an intrinsic. We'll even go the extra mile and make an effort to automatically identify opportunities to use this intrinsic, so now it's not really an intrinsic any more. The Arrays.fill optimization is available on Oracle JDK6u45 (the oldest I keep around, maybe it was there a while before that) and on that JVM it is twice as fast as the SLP generated code.

Over time, SLP gets better and eventually the compiler is now good enough to optimize the fill loop by itself and beat the specialized method. That is an awesome thing. We just need to remove the training wheels now.

And there's a final punch line to this story. Memset/Memcpy are such common and important opportunities for optimization, so Intel has decided to offer an assembly 'recipe' for them and save everyone the effort in writing them:

3.7.6 Enhanced REP MOVSB and STOSB operation (ERMSB)
Beginning with processors based on Intel microarchitecture code name Ivy Bridge, REP string operation using MOVSB and STOSB can provide both flexible and high-performance REP string operations for soft- ware in common situations like memory copy and set operations. Processors that provide enhanced MOVSB/STOSB operations are enumerated by the CPUID feature flag: CPUID:(EAX=7H, ECX=0H):EBX.ERMSB[bit 9] = 1. - [From the Intel Optimization Manual(September 2014)]

From the manual it seems that this method of implementing memcpy/memset can perform well, but like anything else, YMMV (the intel manual discussion of the performance differences is in itself interesting both on the results and the methodology level). One obvious advantage of this method is that it results in much much smaller code that should be trivial to inline into callers. This will however put the SuperWord method at a slight disadvantage, and the tide will change again.
[UPDATE 14/03/2015: It seems the good folks of Oracle have considered and rejected the use of REP MOVSB for array copy.]
Thanks go to the kind reviewers Peter 'Massive' Hughes, Darrach and the Shipster

Tuesday, 6 August 2013

JMM Cookbook Footnote: NOOP Memory Barriers on x86 are NOT FREE

In Doug Lea's excellent JMM Cookbook the following observation is made about memory barriers on different architectures:

Processor	LoadStore	LoadLoad	StoreStore	StoreLoad	Data dependency orders loads?	Atomic Conditional	Other Atomics	Atomics provide barrier?
sparc-TSO	no-op	no-op	no-op	membar (StoreLoad)	yes	CAS: casa	swap,ldstub	full
x86	no-op	no-op	no-op	mfence or cpuid or locked insn	yes	CAS: cmpxchg	xchg, locked insn	full
ia64	combine with st.rel or ld.acq	ld.acq	st.rel	mf	yes	CAS: cmpxchg	xchg, fetchadd	target + acq/rel
arm	dmb (see below)	dmb (see below)	dmb-st	dmb	indirection only	LL/SC: ldrex/strex		target only

This leads quite a few people to conclude that there is no performance cost to the memory barriers that are marked no-op, and on occasion leads people to believe they can just not put them in at all. After all, what is the point in a no-op? surely a program with the no-ops removed is the same program, right? Wrong...
After having this debate again recently I thought I'd document it here to avoid repetition, and to get some second opinions I got some help from the good people on the concurrency-interest mailing list.

What's a Memory Barrier?

This is Noop

From the Cookbook:

"Compilers and processors must both obey reordering rules. No particular effort is required to ensure that uniprocessors maintain proper ordering, since they all guarantee "as-if-sequential" consistency. But on multiprocessors, guaranteeing conformance often requires emitting barrier instructions...
...Memory barrier instructions directly control only the interaction of a CPU with its cache, with its write-buffer that holds stores waiting to be flushed to memory, and/or its buffer of waiting loads or speculatively executed instructions."

We need memory barriers to help the compiler and CPU make sense of our intentions in terms of concurrency. We need to hold them back from optimizing away code that can be eliminated or re-ordered if optimal single threaded execution is the only consideration. The last bit suggests that barriers are really about memory ordering.

Why isn't it free?

Volatile reads (LoadLoad) and lazySet/putOrdered (StoreStore) are both "No-op"s on x86. Mike Barker (LMAX dude, maintainer and contributor to the Disruptor, here's his blog) replied to my request for clarification:

"The place to look (as definitive as it gets for the hardware) is section 8.2.2, volume 3A of the Intel programmer manual. It lists the rules that are applied regarding the reordering of instructions under the X86 memory model. I've summarised the relevant ones here:
- Reads are not reordered with other reads.
- Writes are not reordered with older reads.
- Writes to memory are not reordered with other writes.
- Reads may be reordered with older writes to different locations but not with older writes to the same location.
The only reordering that will occur with x86 is allowing reads to be executed before writes (to other locations), hence the need for a LOCKed instruction to enforce the store/load barrier. As you can see with the above rules, store are not reordered with older stores and loads are not reordered with older loads so a series of MOV instructions is sufficient for a store/store or a load/load barrier."

So since a volatile read and a lazySet both boil down to a MOV from/to memory, no further special instruction is required (hence no-op) as the underlying architecture memory model obeys the rules required by the JMM (Note: lazySet is not in the JMM, but is as good as in it, I go into the origins of lazySet here). But while that does mean these barriers are cheap it doesn't make them free. The above rules Mike is quoting are for the generated assembly, but the barriers impact the compiler as well. Memory barriers are there to stop your code being interpreted a particular way by both the compiler and the CPU. The JMM guarantees that memory barriers are respected by BOTH. This is important because the JMM Cookbook is talking about CPU no-ops, not compiler no-ops. Vitaly Davidovich from concurrency interest replied:

"Yes, volatile loads and lazySet do not cause any cpu fence/barrier instructions to be generated - in that sense, they're noop at the hardware level. However, they are also compiler barriers, which is where the "cheap but ain't free" phrase may apply. The compiler cannot reorder these instructions in ways that violate their documented/spec'd memory ordering effects. So for example, a plain store followed by lazySet cannot actually be moved after the lazySet; whereas if you have two plain stores, the compiler can technically reorder them as it sees fit (if we look at just them two and disregard other surrounding code). So, it may happen that compiler cannot do certain code motion/optimizations due to these compiler fences and therefore you have some penalty vs using plain load and stores. For volatile loads, compiler cannot enregister the value like it would with plain load, but even this may not have noticeable perf diff if the data is in L1 dcache, for example."

The difference in performance between using a plain write and a lazySet has been demonstrated in this post if you look at the difference between the putOrdered and plain writes versions. I expect this difference is far more pronounced in a benchmark than in a full blown system, but the point is it's a demonstrable difference. Similarly here Mr. Brooker demonstrates there is a difference between normal reads and volatile reads. There's a flaw in the experiment in that the volatile writes are making the volatile reads slower, but his point is valid as a demonstration of inhibited optimization (loop is not unrolled, value is not enregistered) and the uncontended read tells that story well.

So... I can't just drop it out of my program?

The TSO model for x86 dictates how your code will behave, if you were writing assembly. If you are writing Java, you are at the mercy of the JIT compiler first, and only then the CPU. If you fail to inform the compiler about your intentions, your instructions may never hit the CPU, or may get there out of order. Your contract is with the JMM first.
It's like punctuation: "helping your uncle Jack off a horse" != "helping your uncle jack off a horse" => punctuation is not a no-op ;-)

Thursday, 25 October 2012

Java Intrinsics are not JNI calls

That is a fact well known to those who know it well... but sadly some confusion still abounds, here's another go of explaining the differences.
An intrinsic is well defined on Wikipedia, to summarize: it is a function 'macro' to be handled by the compiler. The JIT compiler supports a large list of such intrinsic function macros . The beauty of the 2 concepts joined together(intrinsic functions and JIT compilation optimizations) is that the JIT compiler can optimize whole functions into single processor instructions, for the particular processor detected at runtime, and get a great performance boost.
Where it all gets a bit confusing is that the intrinsic functions show up as normal methods or native methods when browsing the source code. Regardless of what they look like (native/Java) they will magically be transformed to far more performant alternative when picked up by the JIT (Caution must be taken with this piece of advice as the JIT compiler behavior can change from JVM to JVM implementation and turn your crafty choice of functions from a speedy chariot to a soggy pumpkin... one such important JVM is the Android Dalvik)
To get an idea of the range of functions which benefit from this nifty trick you can check out this list on this Java gaming Wiki, or have a look in this header file where you can find a more definitive list(look for do_intrinsic) and also get a view on how these things hang together. Some classes/methods on the list:

The wonderful Unsafe. Almost all intrinsics.
Math: abs(double); sin(double); cos(double); tan(double); atan2(double, double); sqrt(double); log(double); log10(double); pow(double, double); exp(double); min(int, int); max(int, int);
System: identityHashCode(Object); currentTimeMillis(); nanoTime(); arraycopy(....);
And many many more...

See Mike Barker's detailed explanation here which is really much more in depth then mine. He did everything I meant to do when I set out to write this little post, so really you must check it out.
One take away from this is that looking through the code is not enough to reason about the performance. In some cases where the performance is surprisingly good you will find an intrinsic standing behind that little boost you didn't see coming.