Let's start at the very beginning, a very good place to start... My very first post on this blog was a short rant on intrinsics, and how they ain't what they seem. In that post I made the following statement:
"intrinsic functions show up as normal methods or native methods"Which is correct. An intrinsic function is applied as a method substitution. A method call will appear in the code and the compiler will replace it's apparent source-level implementation with a pre-cooked implementation. In some cases intrinsics are sort of compilation cheats, the idea being that some bits of functionality are both very important (i.e. worth while optimizing) and can benefit from a hand crafted solution that will be better than what the compiler can achieve. The end result can be in one of a few flavours:
- Method call replaced with a call to a JVM runtime method: E.g. System.arrayCopy is replaced with a call to a method stub generated by the runtime for all array types. This method call is not a JNI call, but it is a static method call that is not inlined.
- Method call replaced with one or more instructions inlined: E.g. Unsafe.getByte/compareAndSet/Math.max
- Method call replaced with compiler IR implementation: E.g. java.lang.reflect.Array.getLength
- A mix of the above: E.g. String.equals is partially implemented in IR, but the array comparison is a call to a method stub.
The Arrays.fill SIMD Opportunity
Arrays.fill is the Java memset (fills an array with a given value), and just like System.arrayCopy (memcpy in C lingo) is worth the effort to optimize and offers the same kind of opportunity. What opportunity might that be, you ask? the opportunity to use SIMD (Single Instruction Multiple Data) instructions when the underlying CPU offers them (I assume for the sake of discussion AVX enabled CPUs i.e. since Sandy Bridge, I find this listing of intel intrinsics useful to explain and sort through the available instructions). These instructions allow the CPU to operate on up to 256 bit (512 bit soon) chunks of data, thus transforming 32 byte sized MOV instructions into a single wide MOV instruction (E.g. the intel C instrinsic _mm256_storeu_si256 or the corresponding instruction vmovdqu). SIMD instructions are good for all sorts of operations on vectors of data, or arrays, which is why the process of transforming element by element operations into SIMD instructions is also referred to as vectorization.
The actual assembly stub is generated dependent on CPU and available instruction set. For x86 the code is generated by the macroAssembler_x86.cpp, and the observant digger into the code will find it makes use of the widest memory instructions it can identify the processor is capable of. Wider is better baby! If you are not morbidly curious about what the implementation looks like, skip the next wall of assembly and you'll be back in Java land shortly.
Here's what the assembly boils down to when UseAVX>=2/UseSSE>=2/UseUnalignedLoadStores=true:
The actual assembly stub is generated dependent on CPU and available instruction set. For x86 the code is generated by the macroAssembler_x86.cpp, and the observant digger into the code will find it makes use of the widest memory instructions it can identify the processor is capable of. Wider is better baby! If you are not morbidly curious about what the implementation looks like, skip the next wall of assembly and you'll be back in Java land shortly.
Here's what the assembly boils down to when UseAVX>=2/UseSSE>=2/UseUnalignedLoadStores=true:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
// This is a hacked up version of the original to be found here: | |
// http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/61746b5f0ed3/src/cpu/x86/vm/macroAssembler_x86.cpp#l6085 | |
// This code is offered under the same licence as the original if | |
// you are fool enough to cut and paste it. | |
void MacroAssembler::generate_fill(BasicType t, bool aligned, | |
Register to, Register value, Register count, | |
Register rtmp, XMMRegister xtmp) { | |
Label L_exit, L_skip_align1, L_skip_align2, L_fill_byte, L_fill_2_bytes | |
Label L_fill_4_bytes, L_fill_32_bytes, L_fill_32_bytes_loop; | |
Label L_check_fill_8_bytes, L_fill_8_bytes_loop, L_fill_8_bytes; | |
Label L_fill_64_bytes_loop, L_check_fill_32_bytes; | |
// setup value 'stamp' in rtmp = <value,value> | |
andl(value, 0xff); // value &= 0xFF; | |
movl(rtmp, value); // rtmp = value << 8 | |
shll(rtmp, 8); | |
orl(value, rtmp); // value |= rtmp; | |
// if (count < 8) goto L_fill_4_bytes; | |
cmpl(count, 8); // Short arrays (< 8 bytes) fill by element | |
jcc(Assembler::below, L_fill_4_bytes); // use unsigned cmp | |
BIND(L_fill_32_bytes); // Fill 64-byte chunks | |
movdl(xtmp, value); | |
vpbroadcastd(xtmp, xtmp); // xtmp is now full of value duplicates, ready to 'stamp' | |
subl(count, 64); // count -= 64 | |
// if (count < 0) goto L_check_fill_32_bytes | |
jcc(Assembler::less, L_check_fill_32_bytes); | |
align(16); | |
BIND(L_fill_64_bytes_loop); | |
vmovdqu(Address(to, 0), xtmp); // 32 bytes write | |
vmovdqu(Address(to, 32), xtmp); // 32 bytes write | |
addptr(to, 64); | |
subl(count, 64); // count -= 64 | |
jcc(Assembler::greaterEqual, L_fill_64_bytes_loop); | |
BIND(L_check_fill_32_bytes); | |
addl(count, 32); // count += 32 | |
// if (count < 0) goto L_check_fill_8_bytes | |
jccb(Assembler::less, L_check_fill_8_bytes); | |
vmovdqu(Address(to, 0), xtmp); // 32 bytes write | |
addptr(to, 32); | |
subl(count, 32); // count -= 32 | |
BIND(L_check_fill_8_bytes); | |
vzeroupper();// clean upper bits of YMM registers | |
addl(count, 32); // count += 32 | |
// if (count == 0) goto L_exit | |
jccb(Assembler::zero, L_exit); | |
// goto L_fill_8_bytes | |
jmpb(L_fill_8_bytes); | |
// length is too short, just fill qwords | |
BIND(L_fill_8_bytes_loop); | |
movq(Address(to, 0), xtmp); // 8 bytes write | |
addptr(to, 8); | |
BIND(L_fill_8_bytes); | |
subl(count, 8); // count -= 8 | |
jcc(Assembler::greaterEqual, L_fill_8_bytes_loop); | |
// fill trailing 4 bytes | |
BIND(L_fill_4_bytes); | |
testl(count, 4); | |
jccb(Assembler::zero, L_fill_2_bytes); | |
movl(Address(to, 0), value); | |
addptr(to, 4); | |
BIND(L_fill_2_bytes); | |
// fill trailing 2 bytes | |
testl(count, 2); | |
jccb(Assembler::zero, L_fill_byte); | |
movw(Address(to, 0), value); | |
addptr(to, 2); | |
BIND(L_fill_byte); | |
// fill trailing byte | |
testl(count, 1); | |
jccb(Assembler::zero, L_exit); | |
movb(Address(to, 0), value); | |
BIND(L_exit); | |
} |
Roughly speaking the algorithm above is:
- Fill up an XMM register with the intended value
- Use the XMM register to write 64 byte chunks (2 vmovdqu) until no more are available
- Write leftover 32 byte chunk (skipped if no matching leftovers)
- Write leftover 8 byte chunks (skipped if no matching leftovers)
- Write leftover 4 bytes (skipped if no matching leftovers)
- Write leftover 2 bytes (skipped if no matching leftovers)
- Write leftover 1 bytes (skipped if no matching leftovers)
The Arrays.fill 'intrinsic'
Arrays.fill is different from System.arrayCopy because, as it's absence from vmSymbols suggests, it's not a method substitution kind of intrinsic (so technically not an intrinsic). What is it then? Arrays.fill is a code pattern substitution kind of compiler shortcut, basically looking for this kind of loop:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
for (int i = 0; i < arrayB.length; i++) { | |
arrayB[i] = b; | |
} |
Pattern matching makes this optimization potentially far more powerful than method substitution as the programmer doesn't have to use a special JDK method to convey meaning, they can just express their meaning in code and the compiler 'knows' that this simple loop means 'fill'. Compare that with System.arrayCopy where rolling your own leads to performance that is much worse than that offered by the intrinsic.
Let's prove me right (my favourite thing, beats kittens and all that crap), here's a JMH (see the JMH reference page for more JMH info/examples) benchmark comparing Arrays.fill to a hand rolled fill, and System.arrayCopy to handrolled array copy:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
@Benchmark | |
public void fillBytes() { | |
Arrays.fill(arrayB, b); | |
} | |
@Benchmark | |
public void manualFillBytes() { | |
fill(arrayB, b); | |
} | |
public static void fill(byte[] a, byte val) { | |
// Copied from Arrays.fill | |
for (int i = 0, len = a.length; i < len; i++) | |
a[i] = val; | |
} | |
@Benchmark | |
public void manualReversedFillBytes() { | |
rfill(arrayB, b); | |
} | |
public static void rfill(byte[] a, byte val) { | |
for (int i = a.length - 1; i >= 0; i--) | |
a[i] = val; | |
} | |
@Benchmark | |
public void copyBytes() { | |
System.arraycopy(arrayB, 0, arrayB2, 0, arrayB.length); | |
} | |
@Benchmark | |
public void manualCopyBytes() { | |
for (int i = 0; i < arrayB.length; i++) { | |
arrayB[i] = arrayB2[i]; | |
} | |
} |
And the results are (Oracle JDK8u40/i7-4770@3.40GHz/Ubuntu, array is 32K in size)?
ArrayFill.fillBytes 561.540 ± 10.814 ns/op
ArrayFill.manualFillBytes 557.901 ± 5.255 ns/op
ArrayFill.manualReversedFillBytes 1017.856 ± 0.425 ns/op
ArrayFill.copyBytes 1300.313 ± 13.482 ns/op
ArrayFill.manualCopyBytes 1477.442 ± 13.030 ns/op
We can verify that the call out to the JVM fill method happens for fillBytes/manualFillBytes by printing out the assembly:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
; fillBytes: | |
0x000000010d1e49bc: mov rdi,QWORD PTR [rsi+0x48] ;*getfield arrayB, ArrayFill::fillBytes@(line 72) | |
0x000000010d1e49c0: mov r10d,DWORD PTR [rdi+0x10];*arraylength, Arrays::fill@(line 2986) <- ArrayFill::fillBytes@(line 72) | |
; implicit exception: dispatches to 0x000000010d1e4a1d | |
0x000000010d1e49c4: test r10d,r10d | |
0x000000010d1e49c7: jle 0x000000010d1e49f5 ;*if_icmpge, Arrays::fill@(line 2986)<- ArrayFill::fillBytes@(line 72) | |
0x000000010d1e49c9: movsx r11d,BYTE PTR [rsi+0x44] ;*getfield b, ArrayFill::fillBytes@(line 72) | |
0x000000010d1e49ce: test r10d,r10d | |
0x000000010d1e49d1: jbe 0x000000010d1e4a01 | |
0x000000010d1e49d3: mov r8d,r10d | |
0x000000010d1e49d6: dec r8d | |
0x000000010d1e49d9: cmp r8d,r10d | |
0x000000010d1e49dc: jae 0x000000010d1e4a01 ; Arrays::fill@(line 2987)<-ArrayFill::fillBytes@(line 72) | |
0x000000010d1e49de: add rdi,0x18 | |
0x000000010d1e49e2: movsxd rdx,r10d | |
0x000000010d1e49e5: mov esi,r11d | |
0x000000010d1e49e8: movabs r10,0x10d086be0 | |
; *********** Calling fill method stub *********** | |
0x000000010d1e49f2: call r10 ;*synchronization entry, ArrayFill::fillBytes@-1(line 72) | |
0x000000010d1e49f5: add rsp,0x20 | |
0x000000010d1e49f9: pop rbp | |
0x000000010d1e49fa: test DWORD PTR [rip+0xffffffffff604600],eax; {poll_return} | |
0x000000010d1e4a00: ret | |
; manualFillBytes: | |
; ... similar shit goes on | |
0x000000010d70861e: add rdi,0x18 | |
0x000000010d708622: movsxd rdx,r10d | |
0x000000010d708625: mov esi,r11d | |
0x000000010d708628: movabs r10,0x10d5a5be0 | |
; *********** Calling fill method stub *********** | |
0x000000010d708632: call r10 ;*synchronization entry, ArrayFill::manualFillBytes@-1(line 77) | |
0x000000010d708635: add rsp,0x20 | |
0x000000010d708639: pop rbp | |
0x000000010d70863a: test DWORD PTR [rip+0xfffffffffe6c59c0],eax # 0x000000010bdce000 | |
; {poll_return} | |
0x000000010d708640: ret | |
; manualReveredFillBytes | |
; ... similar loading of fields and shit, BUT THEN: | |
0x000000010bed421e: cmp r10d,0xe | |
0x000000010bed4222: jle 0x000000010bed425b | |
0x000000010bed4224: vmovd xmm0,ebx | |
0x000000010bed4228: vpunpcklbw xmm0,xmm0,xmm0 | |
0x000000010bed422c: vpshuflw xmm0,xmm0,0x0 | |
0x000000010bed4231: vpunpcklqdq xmm0,xmm0,xmm0 | |
0x000000010bed4235: data32 data32 nop WORD PTR [rax+rax*1+0x0] ;ArrayFill::rfill@(line 91)<- ArrayFill::manualReversedFillBytes@(line 87) | |
0x000000010bed4240: mov r8d,r10d | |
0x000000010bed4243: add r8d,0xfffffffffffffff2 | |
0x000000010bed4247: movsxd r8,r8d | |
0x000000010bed424a: vmovdqu XMMWORD PTR [r11+r8*1+0x17],xmm0 ;ArrayFill::rfill@(line 91)<-ArrayFill::manualReversedFillBytes@(line 87) | |
0x000000010bed4251: add r10d,0xfffffffffffffff0 ;ArrayFill::rfill@13 (line 90)<-ArrayFill::manualReversedFillBytes@(line 87) | |
0x000000010bed4255: cmp r10d,0xe | |
0x000000010bed4259: jg 0x000000010bed4240 ; ArrayFill::rfill@6 (line 90)<-ArrayFill::manualReversedFillBytes@(line 87) | |
0x000000010bed425b: cmp r10d,0xffffffffffffffff | |
0x000000010bed425f: jle 0x000000010bed4275 | |
0x000000010bed4261: data32 xchg ax,ax ;ArrayFill::rfill@(line 91)<-ArrayFill::manualReversedFillBytes@(line 87) | |
0x000000010bed4264: movsxd r8,r10d | |
0x000000010bed4267: mov BYTE PTR [r11+r8*1+0x18],bl ;ArrayFill::rfill@(line 91)<-ArrayFill::manualReversedFillBytes@8(line 87) | |
0x000000010bed426c: dec r10d ;ArrayFill::rfill@(line 90)<-ArrayFill::manualReversedFillBytes@(line 87) | |
0x000000010bed426f: cmp r10d,0xffffffffffffffff | |
0x000000010bed4273: jg 0x000000010bed4264 ;ArrayFill::manualReversedFillBytes@(line 87) | |
0x000000010bed4275: add rsp,0x20 | |
0x000000010bed4279: pop rbp | |
0x000000010bed427a: test DWORD PTR [rip+0xfffffffffe68bd80],eax ; {poll_return} | |
0x000000010bed4280: ret |
- Use System.arrayCopy, it is better than your handrolled loop. But surprisingly not hugely better, hmmm.
- You don't have to use Arrays.fill, you can roll your own and it works the same. Notice the call out to the fill method. But...
- Don't get too creative rolling your own. If you get too funky (like filling the array backwards) it'll fall apart and the 'intrinsic' won't happen. But do note that the reverse fill still has some of that good SIMD stuff going, we'll get to that in a sec.
Are The Other Types Filling The Love?
It all sounds great don't it? Let's see how this pans out for other types. We'll be filling an array of 32KB. To be uniform across data types that means a 16K chars/shorts array, an 8K ints/floats array and a 4K array of longs. I added an 8K array of objects, which is the same size for compressed oops on the Oracle JVM (reference size is 4 bytes, same as an int).The JMH benchmark code is as you'd expect:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
@Setup | |
public void setup() { | |
length = 1 << pow2size; | |
arrayB = new byte[length]; | |
arrayC = new char[length/2]; | |
arrayS = new short[length/2]; | |
arrayI = new int[length/4]; | |
arrayF = new float[length/4]; | |
arrayL = new long[length/8]; | |
arrayD = new double[length/8]; | |
arrayO = new Object[length/4]; | |
} | |
@Benchmark | |
public void fillBytes() { | |
Arrays.fill(arrayB, b); | |
} | |
@Benchmark | |
public void fillChars() { | |
Arrays.fill(arrayC, c); | |
} | |
@Benchmark | |
public void fillShorts() { | |
Arrays.fill(arrayS, s); | |
} | |
@Benchmark | |
public void fillInts() { | |
Arrays.fill(arrayI, i); | |
} | |
@Benchmark | |
public void fillFloats() { | |
Arrays.fill(arrayF, f); | |
} | |
@Benchmark | |
public void fillLongs() { | |
Arrays.fill(arrayL, l); | |
} | |
@Benchmark | |
public void fillDoubles() { | |
Arrays.fill(arrayD, d); | |
} | |
@Benchmark | |
public void fillObjects() { | |
Arrays.fill(arrayO, o); | |
} |
- If no optimizations are present, wider writes are more efficient. It follows that the longFill would be the fastest. But...
- Given a clever compiler the fill loop is replaced with the widest writes possible, so there should be no significant difference. But the fill optimization does not cover double/long/object arrays, so we might expect longFill to be the worst performer.
- An objects array is not that different from an int array, so performance should be similar. Sure there's a write barrier, but it need only be done once per card (not once for the whole array as I thought initially, god bless Shipilev and PrintAssembly), so that's an extra byte write per card of elements filled. A card is per 512 bytes, each element is 4 bytes, so that's one card per 128 elements. Given there is no fill method implemented for it we may expect it to be slightly worse than the longFill.
- We should not rely on expectations, because performance is best measured.
ArrayFill.fillBytes 561.540 ± 10.814 ns/op
ArrayFill.fillChars 541.901 ± 4.833 ns/op
ArrayFill.fillShorts 532.936 ± 4.508 ns/op
ArrayFill.fillInts 543.165 ± 3.788 ns/op
ArrayFill.fillFloats 537.599 ± 2.323 ns/op
ArrayFill.fillLongs 326.770 ± 3.267 ns/op
ArrayFill.fillDoubles 346.840 ± 5.786 ns/op
ArrayFill.fillObjects 4388.242 ± 11.945 ns/op
Say WOT?
For bytes/chars/shorts/ints/floats Arrays.fill performs very similarly. This much is as expected from the second point above. But filling an array of longs/doubles is better than the others. The funny thing is, there's no fill function implemented for the long array, how come it is so darn quick? Also, why does the objects fill suck quite so badly when compared with the rest (I will not be addressing this last question! I refuse! this post is too fucking long as it is!)?
This is what happens when we turn off the OptimizeFill flag:
ArrayFill.fillBytes 1013.803 ± 0.227 ns/op
ArrayFill.fillChars 323.806 ± 3.879 ns/op
ArrayFill.fillShorts 323.689 ± 4.499 ns/op
ArrayFill.fillInts 326.336 ± 1.559 ns/op
ArrayFill.fillFloats 319.449 ± 2.048 ns/op
ArrayFill.fillLongs 328.692 ± 3.282 ns/op
ArrayFill.fillDoubles 345.035 ± 6.362 ns/op
ArrayFill.fillObjects 4397.130 ± 7.161 ns/op
ArrayFill.fillChars 323.806 ± 3.879 ns/op
ArrayFill.fillShorts 323.689 ± 4.499 ns/op
ArrayFill.fillInts 326.336 ± 1.559 ns/op
ArrayFill.fillFloats 319.449 ± 2.048 ns/op
ArrayFill.fillLongs 328.692 ± 3.282 ns/op
ArrayFill.fillDoubles 345.035 ± 6.362 ns/op
ArrayFill.fillObjects 4397.130 ± 7.161 ns/op
Strange innit? now we got char/int/long arrays all performing similarly. In fact, with the exception of the byte array, everything is better than it was with the optimization.
ArrayFill.fillBytes 8501.270 ± 2.896 ns/op
ArrayFill.fillChars 4286.161 ± 4.935 ns/op
ArrayFill.fillShorts 4286.168 ± 3.146 ns/op
ArrayFill.fillInts 2152.489 ± 2.653 ns/op
ArrayFill.fillFloats 2140.649 ± 2.587 ns/op
ArrayFill.fillLongs 1105.926 ± 2.228 ns/op
ArrayFill.fillDoubles 1105.820 ± 2.393 ns/op
ArrayFill.fillObjects 4392.506 ± 11.678 ns/op
Life is revealed in all it's sucky splendour! This is what happens when the compiler shows you no love... did I say no love? hang on, things can get a bit worse.
The results are:
ArrayFill.unsafeFillOffheapBytes 9742.621 ± 2.270 ns/op
ArrayFill.unsafeFillOnHeapBytes 12640.019 ± 1.977 ns/op
ArrayFill.fillBytes(for reference) 561.540 ± 10.814 ns/op
The Unsafe variant do not enjoy the 'fill' pattern matching magic, nor do they get the SuperWord optimizations. What can you do? For this kind of thing you should use the Unsafe.setMemory method instead:
With the result:
ArrayFill.unsafeSetOffheapBytes 1259.281 ± 21.294 ns/op
ArrayFill.unsafeSetOnHeapBytes 1275.158 ± 27.950 ns/op
Not quite there, still ~2x worse (why? how come it doesn't just call the bytes fill method? a bit of digging shows it ends up calling the underlying platform's memset...) but beats being 20-25x worse like the handrolled method is.
[UPDATE 14/03/2015: It seems the good folks of Oracle have considered and rejected the use of REP MOVSB for array copy.]
Thanks go to the kind reviewers Peter 'Massive' Hughes, Darrach and the Shipster
Superword to the rescue!
Turns out the JIT compiler is clued up on the topic of SIMD parallelisation by way of Superword Level Parallelism (see the original paper here):In some respects, superword level parallelism is a restricted form of ILP (Instruction Level Parallelism). ILP techniques have been very successful in the general purpose computing arena, partly because of their ability to find parallelism within basic blocks. In the same way that loop unrolling translates loop level parallelism into ILP, vector parallelism can be transformed into SLP. This realization allows for the parallelization of vectorizable loops using the same basic block analysis. As a result, our algorithm does not require any of the complicated loop transformations typically associated with vectorization. In fact, vector parallelism alone can be uncovered using a simplified version of the SLP compiler algorithm.The Hotspot compiler implements SLP optimizations in superword.cpp and you are invited to dive into the implementation if you like. I'm going to focus on it's impact here, and to do that I only need to know how to turn it on and off (core competency for any software person). It's on by default, so above results are what happens when it is on, here's what life looks like when it is off too (so -XX:-OptimizeFill -XX:-UseSuperWord):
...
Superword level parallelism is defined as short SIMD parallelism in which the source and result operands of a SIMD operation are packed in a storage location.
...
Vector parallelism is a subset of superword level parallelism.
ArrayFill.fillBytes 8501.270 ± 2.896 ns/op
ArrayFill.fillChars 4286.161 ± 4.935 ns/op
ArrayFill.fillShorts 4286.168 ± 3.146 ns/op
ArrayFill.fillInts 2152.489 ± 2.653 ns/op
ArrayFill.fillFloats 2140.649 ± 2.587 ns/op
ArrayFill.fillLongs 1105.926 ± 2.228 ns/op
ArrayFill.fillDoubles 1105.820 ± 2.393 ns/op
ArrayFill.fillObjects 4392.506 ± 11.678 ns/op
Life is revealed in all it's sucky splendour! This is what happens when the compiler shows you no love... did I say no love? hang on, things can get a bit worse.
Detour: Unsafe? We don't serve you kind here
To all the Unsafe fans, I got some sad news for y'all. Unsafe 'fill' loops are not well loved by the compiler. This is the price of stepping off the beaten path I guess. Consider the following benchmark:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
@Benchmark | |
public void unsafeFillOffheapBytes() { | |
for (int i = 0; i < length; i++) { | |
UNSAFE.putByte(address + i, b); | |
} | |
} | |
@Benchmark | |
public void unsafeFillOnHeapBytes() { | |
for (int i = 0; i < arrayB.length; i++) { | |
UNSAFE.putByte(arrayB, B_ARRAY_OFFSET + i, b); | |
} | |
} |
The results are:
ArrayFill.unsafeFillOffheapBytes 9742.621 ± 2.270 ns/op
ArrayFill.unsafeFillOnHeapBytes 12640.019 ± 1.977 ns/op
ArrayFill.fillBytes(for reference) 561.540 ± 10.814 ns/op
The Unsafe variant do not enjoy the 'fill' pattern matching magic, nor do they get the SuperWord optimizations. What can you do? For this kind of thing you should use the Unsafe.setMemory method instead:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
@Benchmark | |
public void unsafeSetOnHeapBytes() { | |
UNSAFE.setMemory(arrayB, B_ARRAY_OFFSET, length, b); | |
} | |
@Benchmark | |
public void unsafeSetOffheapBytes() { | |
UNSAFE.setMemory(address, length, b); | |
} |
ArrayFill.unsafeSetOffheapBytes 1259.281 ± 21.294 ns/op
ArrayFill.unsafeSetOnHeapBytes 1275.158 ± 27.950 ns/op
Not quite there, still ~2x worse (why? how come it doesn't just call the bytes fill method? a bit of digging shows it ends up calling the underlying platform's memset...) but beats being 20-25x worse like the handrolled method is.
Summary and Musings
![]() |
It's the circle of life! |
So what did we learn:
- There's another kind of 'intrinsic' like optimization, which uses pattern matching to swap a block of code rather than a method. This is employed for memset like memory fill loops (in particular Arrays.fill) intrinsicfication. It's not an intrinsic technically, but you know what I fucking mean.
- System.arrayCopy/Arrays.fill implementations utilize SIMD instructions to improve their efficiency. These instructions are not available in plain Java, so some compiler intervention is required.
- The JIT compiler is also able to use SuperWord Level Parallelism to derive SIMD code from 'normal' sequential code.
- In the case of Arrays.fill, it looks like the SuperWord optimized code is faster than the fill specialized implementation for all types except bytes (on the system under test)
- If you use Unsafe you will be excluded from these optimizations.
We want to use SIMD instructions, but the JIT compiler isn't really clever enough to generate them by itself. Memset implementations are rather specialized after all. Let's make life a bit easier for the compiler by creating an intrinsic. We'll even go the extra mile and make an effort to automatically identify opportunities to use this intrinsic, so now it's not really an intrinsic any more. The Arrays.fill optimization is available on Oracle JDK6u45 (the oldest I keep around, maybe it was there a while before that) and on that JVM it is twice as fast as the SLP generated code.
Over time, SLP gets better and eventually the compiler is now good enough to optimize the fill loop by itself and beat the specialized method. That is an awesome thing. We just need to remove the training wheels now.
And there's a final punch line to this story. Memset/Memcpy are such common and important opportunities for optimization, so Intel has decided to offer an assembly 'recipe' for them and save everyone the effort in writing them:
3.7.6 Enhanced REP MOVSB and STOSB operation (ERMSB)From the manual it seems that this method of implementing memcpy/memset can perform well, but like anything else, YMMV (the intel manual discussion of the performance differences is in itself interesting both on the results and the methodology level). One obvious advantage of this method is that it results in much much smaller code that should be trivial to inline into callers. This will however put the SuperWord method at a slight disadvantage, and the tide will change again.
Beginning with processors based on Intel microarchitecture code name Ivy Bridge, REP string operation using MOVSB and STOSB can provide both flexible and high-performance REP string operations for soft- ware in common situations like memory copy and set operations. Processors that provide enhanced MOVSB/STOSB operations are enumerated by the CPUID feature flag: CPUID:(EAX=7H, ECX=0H):EBX.ERMSB[bit 9] = 1. - [From the Intel Optimization Manual(September 2014)]
[UPDATE 14/03/2015: It seems the good folks of Oracle have considered and rejected the use of REP MOVSB for array copy.]
Thanks go to the kind reviewers Peter 'Massive' Hughes, Darrach and the Shipster
Nitsan
ReplyDeleteThanks for the wonderful post... Very timely (I've been looking into SIMD compiler optimizations after I saw Mike Barkers video where he compared C and Java implementations of SimpleBinaryEncoding
http://www.infoq.com/presentations/performance-safety
(where @ about 35:15 he noted that the Java JIT compiler (by default) uses SIMD under the covers, whereas C requires you to "opt in" with compiler flags)
"Java is fast by default"
Cheers
Eric
The JVM is prety cool on that score, yes.
DeleteI'm not sure how well GCC handles vectorization compared to C2, that would make an interesting topic. The thing that bothers people with Java and SIMD is that the compiler can't always figure out what they want to do. In C they could use SIMD intrinsics or assembly, in Java they are stuck.
Great post (again) ! I made benchmarks regarding arrays.fillObject some 2-4 years ago and found the fastest way to null an object array was to system.arraycopy from a static empty object array. Do you know since which release the "pattern driven" optimization is present in JDK ?
ReplyDelete(from fast-serialization):
public static void clear(Object[] arr, int arrlen) {
int count = 0;
final int length = EmptyObjArray.length;
while (arrlen - count > length) {
System.arraycopy(EmptyObjArray, 0, arr, count, length);
count += length;
}
System.arraycopy(EmptyObjArray, 0, arr, count, arrlen - count);
}
The pattern matching replacing thingy is in the JVM since 1.6, but it doesn't cover Object[]/long[]/double[] so really is not a help for what you're trying to do.
DeleteIf you want a hacky way of clearing the array using memset you can use Unsafe.setMemory on the array and set the elements to 0 (so something like: UNSAFE.setMemory(arrayB, UNSAFE.arrayBaseOffset(Object[].class), length*UNSAFE.arrayIndexScale(Object[].class), 0);). Since you're nulling it the card marking is irrelevant anyhow (well, unless you use G1GC? or some other GC algo where it matters?). This is a terrible idea so please don't do it.
Thanks for not-advice. Hm .. i also use this technique to clear int arrays, can't remember if I also benchmarked those or just assumed it would also be faster (like with Object arrays). Smells like I can safe zem nanos :-)
DeleteFollow up a quick and dirty test measuring performance of clearing int[] and Object[] arrays:
ReplyDeleteclear int array using Arrays.fill : 500ms
clear int array using System.arraycopy of static empty array : 700ms
clear Object array using Arrays.fill : 3280ms (!!)
clear int array using System.arraycopy of static empty array : 700ms
so indeed the trickery does only payoff for object arrays.
(bench source https://github.com/RuedigerMoeller/fast-serialization/blob/master/src/main/java/org/nustaq/serialization/util/FSTUtil.java)
ARGG ! fell into autoboxing trap by Arrays.fill( Object[], 0 ).
DeleteCorrect number for Object[] clear with Arrays.fill is 1390 !
why, oh why would you not use JMH?
Deletetime. just a five minute thingy inline .. results are stable and give a sufficient impression.
DeleteRight tool for the job, takes a second to knock up and you get all the tooling for free. Once you get in the habit you'll thank me.
DeleteGreat reading.
ReplyDeleteDo you know if similar SIMD optimizations are implemented for summing arrays?
for(int i = 0; i<LENGTH; i++) { result[i] = a[i] + b[i]; }
In an ideal world (with AVX2 support) you can have the above transformed into a non-destructive 3 register vectorized instruction. The JIT compiler won't do that for you, but I think it would vectorize it into a wide copy and increment in place.
DeleteThe thing to do is write some JMH code around it and look at the generated assembly with perfasm.
Let me know how it went :-)