Psychosomatic, Lobotomy, Saw: April 2015

Wednesday, 22 April 2015

Porting Pitfalls: Turning D.Vyukov MPSC Wait-free queue into a j.u.Queue

{This post is part of a long running series on lock free queues, checkout the full index to get more context here}
D. Vyukov is an awesome lock-free dude, and I often refer to his instructive and invaluable website 1024cores.net in my posts. On his site he covers lock free queue implementations and in particular a wait-free MPSC linked node queue. This is really rather special when you consider that normally MP would imply lock-free rather than wait-free guarantees. I've ported his algorithm to Java (and so have many others: Netty/Akka/RxJava etc.), and had to tweak it to match the Queue interface. In this post I'd like to explain the algorithm, it's translation to Java, and the implications of making it a j.u.Queue.

Lock free vs. Wait free

Let's review the definitions:

Wait-free: thread progress is guaranteed, all operations finish in a determined number of steps.
Lock-free: global progress is guaranteed, though a particular thread may be blocked.

An example of a transition from lock free to wait free is available with JDK8 changes to AtomicReference::getAndSet(). The change was made by utilizing the newly available Unsafe::getAndSetObject intrinsic which translates directly to XCHG (on x86). So where we used to have for AtomicReference:
T getAndSet(T newVal) {
T currentValue;
do {
currentValue = val; // val is a volatile field
} while (!Unsafe.compareAndSwapObject(this, VAL_FIELD_OFFSET, currentValue, newValue));
return currentValue;
}
Now we have:
T getAndSet(T newVal) {
return Unsafe.getAndSetObject(this, VAL_FIELD_OFFSET, newValue);
}
I discussed a similar change to AtomicLong.getAndAdd in a previous post, replacing the CAS loop with LOCK XADD.

The Vyukov Wait Free MPSC queue

This is a LinkedList type structure and the interesting methods are offer and poll, here's the original (I did some formatting):

	struct mpscq_node_t {
	mpscq_node_t* volatile next;
	void* state;
	};

	struct mpscq_t {
	mpscq_node_t* volatile head;
	mpscq_node_t* tail;
	};

	void mpscq_create(mpscq_t* self, mpscq_node_t* stub) {
	stub->next = 0;
	self->head = stub;
	self->tail = stub;
	}
	void mpscq_offer(mpscq_t* self, mpscq_node_t* n) {
	n->next = NULL;
	mpscq_node_t* prev = XCHG(&self->head, n); // serialization-point wrt producers, acquire-release
	prev->next = n; // serialization-point wrt consumer, release
	}

	mpscq_node_t* mpscq_poll(mpscq_t* self) {
	mpscq_node_t* tail = self->tail;
	mpscq_node_t* next = tail->next; // serialization-point wrt producers, acquire
	if (next != NULL) {
	self->tail = next;
	tail->state = next->state;
	return tail;
	}
	return NULL;
	}

view raw gistfile1.c hosted with ❤ by GitHub

Awesome in it's simplicity, deceptive in it's genius. Be aware that head/tail meaning is the other way around than what most people are used to. I personally go for producer/consumerNode (head = producer side, tail = consumer side in the above snippet) in my code, but for consistency I'll stick with Mr. Vs notation for the porting exercise.
But how do we manage the same memory barriers in Java? We can be nice about it and use the AtomicFieldUpdater or more brutal and use Unsafe. I find you get better performance with Unsafe, but you should consider how appropriate this is for you. In any case, here's what we end up with:

	class MpscQueue<T> {
	// This song and dance is required to get field offset which we need for getAndSet
	private static final long HEAD_FIELD_OFFSET;
	static {
	try {
	HEAD_FIELD_OFFSET = UNSAFE.objectFieldOffset(MpscQueue.class.getDeclaredField("head"));
	} catch (NoSuchFieldException e) {
	throw new RuntimeException(e);
	}
	}
	static class Node<T> {
	T val;
	// C 'volatile' is not the same as Java 'volatile', but in this case they happen to
	// apply to the same fields.
	volatile Node<T> next;
	}

	volatile Node<T> head;
	Node<T> tail;

	MpscQueue() {
	head = tail = new Node<T>();
	}

	public boolean offer(T val) {
	if (val == null) throw new IllegalArgumentException("You bastard!!!");
	Node<T> n = new Node<T>();
	n.val = val;
	// getAndSet acts as an effective StoreLoad[0]
	// see https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/atomic/package-summary.html
	// The LOCK XCHG mechanics create the ordering between producers, hence Vyukov comment on
	// serialization.
	Node<T> prev = (Node<T>) UNSAFE.getAndSetObject(this, HEAD_FIELD_OFFSET, n);
	// StoreLoad[1] (StoreStore would do we can weaken this write to a lazySet/putOrdered)
	prev.next = n;
	return true;
	}

	public T poll() {
	// LoadLoad matching StoreLoad[1]
	Node<T> next = tail.next;
	// this establishes the HB relationship between producer and consumer
	if (next != null) {
	T val = next.val;
	next.val = null;
	tail = next;
	return val;
	}
	return null;
	}
	}

view raw gistfile1.java hosted with ❤ by GitHub

The code is pretty similar. Now if we wanted to complete the j.u.Queue we could extend AbstractQueue and implement only size()/peek() and we're done (I'm not going to bother with the iterator()):

	// to save us some effort we extend AbstractQueue, this requires we implement
	// 3 more methods.
	class MpscQueue<T> extends AbstractQueue<T> {
	@Override
	public T peek() {
	Node<T> next = tail.next;
	if (next != null) {
	return next.val;
	}
	return null;
	}

	@Override
	public int size() {
	int size = 0;
	Node<T> chaser = tail;
	while ((chaser = chaser.next) != null) {
	size++;
	}
	return size;
	}

	@Override
	public Iterator<T> iterator() {
	return null; // fuck it...
	}
	}

view raw gistfile1.java hosted with ❤ by GitHub

There we go, seems reasonable don't it? off to the pub to celebrate some kick ass lock free programming!

Vyukov::poll() ain't Queue::poll!

I've discussed this particular annoyance in a previous post which has some overlap with this one. The names are the same, but as it turns out the guarantees are quite different. While Vyukov is doing a fine job implementing a queue, not any queue is a j.u.Queue. In particular for this case, the poll() method has a different contract:

j.u.Queue: Retrieves and removes the head of this queue, or returns null if this queue is empty.
Vyukov: Retrieves and removes the head of this queue, or returns null if next element is not available.

Why wouldn't the next element be available? Doesn't that mean the queue is empty? Sadly this ain't the case. Imagine for instance there are 2 producers and we run both threads step by step in a debugger. We break at line 33 (4th line of offer, the Java version):

Producer 1: we step 1 line, we just replaced head with node n. We are suspended before executing line 35.
Producer 2: we let the program continue. We've replaced head and linked it to the previous head.

What state are we in? Let's look at head/tail nodes and where they lead (I number the nodes in order of assignment to head):

head = Node[2], this is the node created by Producer 2. We also know that Node[1].next = Node[2], because we let offer run it's course on Producer 2.
tail = Node[0] the node we allocated in the constructor. This is the node head was before the first producer came along. This is what prev is equal to for Producer 1, but because we suspended that thread it never set it's next value. Node[0].next is still null!

If a consumer came along now they would get a null from poll(), indicating the queue is empty. But the queue is obviously not empty!
So it seems we cannot deduce the queue is empty from looking at tail.next here's 2 valid indicators that the queue is empty:

head == tail : this is the starting point set in the constructor and where the consumer ends up after consuming the last element
head.val == null : head can only have a value of null if it is tail

Here's a solution to a correct poll() and the knock on effect on peek():

	@Override
	public boolean isEmpty() {
	return head == tail;
	}

	@Override
	public T poll() {
	Node<T> t = tail; // avoid reloading after volatile load
	Node<T> next = t.next; // volatile load
	if (next != null) {
	return consumeNode(next);
	}
	else if (t != head) { // same as isEmpty, but no reason to reload tail
	// spin wait for next to show up
	while ((next = t.next) == null);
	return consumeNode(next);
	}
	return null;
	}

	@Override
	public T peek() {
	Node<T> t = tail;
	Node<T> next = t.next;
	if (next != null) {
	return next.val;
	}
	else if (t != head) {
	// spin wait for next to show up
	while ((next = t.next) == null);
	return next.val;
	}
	return null;
	}

	private T consumeNode(Node<T> next) {
	T val = next.val;
	next.val = null;
	tail = next;
	return val;
	}

view raw gistfile1.java hosted with ❤ by GitHub

This is a bit annoying because now we have a wait-free offer(), but poll() and peek() are lock-free(only block one thread, producers can make progress).
This pitfall is tricky enough that not only did I fall for it (on another queue algorithm, but same mistake), it also took by surprise Mr. Manes writing an interesting variant on this queue for his high performance cache implementation (I filed an issue which he promptly fixed), and struck the notorious Dave Dice when considering Vyukov's MPMC queue (see comments for discussion).

So is this it? Are we done?
Almost... size() is still broken.

A Good Size

It's not really surprising that size is broken given the terminating condition for the size loop was relying on next == null to terminate the count. Size was also broken in 2 other subtle ways:

The interface for size dictates that it returns a positive int. But given that the queue is unbounded it is possible (though very unlikely) for it to have more than 2^31 elements. This would require that the linked list consume over 64GB (16b + 8b + 8b=32b per element, refs are 8b since this requires more than 32GB heap so no compressed oops). Unlikely, but not impossible. This edge condition is handled for ConcurrentLinkedQueue (same as here), and for LinkedBlockingQueue (by bounding it's size to MAX_INT, so it's not really unbounded after all) but not for LinkedList (size is maintained in an int and is blindly incremented).
Size is chasing a moving target and as such can in theory never terminate as producers keep adding nodes. We should limit the scope of the measurement to the size of the queue at the time of the call. This is not done for CLQ and should perhaps be considered.

Here's the end result:

	@Override
	public int size() {
	Node<T> chaserNode = tail;
	final Node<T> currentHead = head;

	int size = 0;
	while (chaserNode != currentHead && size < Integer.MAX_VALUE) {
	Node<T> next;
	while((next = chaserNode.next) == null);
	chaserNode = next;
	size++;
	}
	return size;
	}

view raw gistfile1.java hosted with ❤ by GitHub

If you need a full SPSC/MPSC linked queue implementation they are available in JCTools for your pleasure, enjoy!

Special thanks to Doug and Ben for reviewing!

Monday, 13 April 2015

On Arrays.fill, Intrinsics, SuperWord and SIMD instructions

{This post turned rather long, if you get lazy feel free to skip to the summary}
Let's start at the very beginning, a very good place to start... My very first post on this blog was a short rant on intrinsics, and how they ain't what they seem. In that post I made the following statement:

"intrinsic functions show up as normal methods or native methods"

Which is correct. An intrinsic function is applied as a method substitution. A method call will appear in the code and the compiler will replace it's apparent source-level implementation with a pre-cooked implementation. In some cases intrinsics are sort of compilation cheats, the idea being that some bits of functionality are both very important (i.e. worth while optimizing) and can benefit from a hand crafted solution that will be better than what the compiler can achieve. The end result can be in one of a few flavours:

Method call replaced with a call to a JVM runtime method: E.g. System.arrayCopy is replaced with a call to a method stub generated by the runtime for all array types. This method call is not a JNI call, but it is a static method call that is not inlined.
Method call replaced with one or more instructions inlined: E.g. Unsafe.getByte/compareAndSet/Math.max
Method call replaced with compiler IR implementation: E.g. java.lang.reflect.Array.getLength
A mix of the above: E.g. String.equals is partially implemented in IR, but the array comparison is a call to a method stub.

The intrinsics are all set up in vmSymbols.hpp and if you look, you'll see Arrays.fill is NOT on the list. So why am I talking about Chewbacca? Because it is something like an intrinsic...

The Arrays.fill SIMD Opportunity

Arrays.fill is the Java memset (fills an array with a given value), and just like System.arrayCopy (memcpy in C lingo) is worth the effort to optimize and offers the same kind of opportunity. What opportunity might that be, you ask? the opportunity to use SIMD (Single Instruction Multiple Data) instructions when the underlying CPU offers them (I assume for the sake of discussion AVX enabled CPUs i.e. since Sandy Bridge, I find this listing of intel intrinsics useful to explain and sort through the available instructions). These instructions allow the CPU to operate on up to 256 bit (512 bit soon) chunks of data, thus transforming 32 byte sized MOV instructions into a single wide MOV instruction (E.g. the intel C instrinsic _mm256_storeu_si256 or the corresponding instruction vmovdqu). SIMD instructions are good for all sorts of operations on vectors of data, or arrays, which is why the process of transforming element by element operations into SIMD instructions is also referred to as vectorization.
The actual assembly stub is generated dependent on CPU and available instruction set. For x86 the code is generated by the macroAssembler_x86.cpp, and the observant digger into the code will find it makes use of the widest memory instructions it can identify the processor is capable of. Wider is better baby! If you are not morbidly curious about what the implementation looks like, skip the next wall of assembly and you'll be back in Java land shortly.
Here's what the assembly boils down to when UseAVX>=2/UseSSE>=2/UseUnalignedLoadStores=true:

	// This is a hacked up version of the original to be found here:
	// http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/61746b5f0ed3/src/cpu/x86/vm/macroAssembler_x86.cpp#l6085
	// This code is offered under the same licence as the original if
	// you are fool enough to cut and paste it.
	void MacroAssembler::generate_fill(BasicType t, bool aligned,
	Register to, Register value, Register count,
	Register rtmp, XMMRegister xtmp) {
	Label L_exit, L_skip_align1, L_skip_align2, L_fill_byte, L_fill_2_bytes
	Label L_fill_4_bytes, L_fill_32_bytes, L_fill_32_bytes_loop;
	Label L_check_fill_8_bytes, L_fill_8_bytes_loop, L_fill_8_bytes;
	Label L_fill_64_bytes_loop, L_check_fill_32_bytes;

	// setup value 'stamp' in rtmp = <value,value>
	andl(value, 0xff); // value &= 0xFF;
	movl(rtmp, value); // rtmp = value << 8
	shll(rtmp, 8);
	orl(value, rtmp); // value \|= rtmp;

	// if (count < 8) goto L_fill_4_bytes;
	cmpl(count, 8); // Short arrays (< 8 bytes) fill by element
	jcc(Assembler::below, L_fill_4_bytes); // use unsigned cmp

	BIND(L_fill_32_bytes); // Fill 64-byte chunks
	movdl(xtmp, value);
	vpbroadcastd(xtmp, xtmp); // xtmp is now full of value duplicates, ready to 'stamp'
	subl(count, 64); // count -= 64

	// if (count < 0) goto L_check_fill_32_bytes
	jcc(Assembler::less, L_check_fill_32_bytes);

	align(16);
	BIND(L_fill_64_bytes_loop);
	vmovdqu(Address(to, 0), xtmp); // 32 bytes write
	vmovdqu(Address(to, 32), xtmp); // 32 bytes write
	addptr(to, 64);
	subl(count, 64); // count -= 64
	jcc(Assembler::greaterEqual, L_fill_64_bytes_loop);

	BIND(L_check_fill_32_bytes);
	addl(count, 32); // count += 32
	// if (count < 0) goto L_check_fill_8_bytes
	jccb(Assembler::less, L_check_fill_8_bytes);
	vmovdqu(Address(to, 0), xtmp); // 32 bytes write
	addptr(to, 32);
	subl(count, 32); // count -= 32

	BIND(L_check_fill_8_bytes);
	vzeroupper();// clean upper bits of YMM registers

	addl(count, 32); // count += 32
	// if (count == 0) goto L_exit
	jccb(Assembler::zero, L_exit);
	// goto L_fill_8_bytes
	jmpb(L_fill_8_bytes);

	// length is too short, just fill qwords
	BIND(L_fill_8_bytes_loop);
	movq(Address(to, 0), xtmp); // 8 bytes write
	addptr(to, 8);

	BIND(L_fill_8_bytes);
	subl(count, 8); // count -= 8
	jcc(Assembler::greaterEqual, L_fill_8_bytes_loop);

	// fill trailing 4 bytes
	BIND(L_fill_4_bytes);
	testl(count, 4);
	jccb(Assembler::zero, L_fill_2_bytes);
	movl(Address(to, 0), value);
	addptr(to, 4);

	BIND(L_fill_2_bytes);
	// fill trailing 2 bytes
	testl(count, 2);
	jccb(Assembler::zero, L_fill_byte);
	movw(Address(to, 0), value);
	addptr(to, 2);

	BIND(L_fill_byte);
	// fill trailing byte
	testl(count, 1);
	jccb(Assembler::zero, L_exit);
	movb(Address(to, 0), value);
	BIND(L_exit);
	}

view raw gistfile1.cpp hosted with ❤ by GitHub

Roughly speaking the algorithm above is:

Fill up an XMM register with the intended value
Use the XMM register to write 64 byte chunks (2 vmovdqu) until no more are available
Write leftover 32 byte chunk (skipped if no matching leftovers)
Write leftover 8 byte chunks (skipped if no matching leftovers)
Write leftover 4 bytes (skipped if no matching leftovers)
Write leftover 2 bytes (skipped if no matching leftovers)
Write leftover 1 bytes (skipped if no matching leftovers)

It ain't nice, but we do what we gotta for performance! There are variations of the above described across the internets as the done thing for a memset implementation, this might seem complex but is pretty standard... anyway, moving right along.

The Arrays.fill 'intrinsic'

Arrays.fill is different from System.arrayCopy because, as it's absence from vmSymbols suggests, it's not a method substitution kind of intrinsic (so technically not an intrinsic). What is it then? Arrays.fill is a code pattern substitution kind of compiler shortcut, basically looking for this kind of loop:

	for (int i = 0; i < arrayB.length; i++) {
	arrayB[i] = b;
	}

view raw gistfile1.java hosted with ❤ by GitHub

And replacing it with a call into the JVM memset implementation (I recently learnt the same thing is done by GCC as well, see code to assembly here). The pattern matching bit is done in loopTransform.cpp. This feels enough like an intrinsic grey area that the method doing the pattern match and replace is called intrinsify_fill.
Pattern matching makes this optimization potentially far more powerful than method substitution as the programmer doesn't have to use a special JDK method to convey meaning, they can just express their meaning in code and the compiler 'knows' that this simple loop means 'fill'. Compare that with System.arrayCopy where rolling your own leads to performance that is much worse than that offered by the intrinsic.
Let's prove me right (my favourite thing, beats kittens and all that crap), here's a JMH (see the JMH reference page for more JMH info/examples) benchmark comparing Arrays.fill to a hand rolled fill, and System.arrayCopy to handrolled array copy:

	@Benchmark
	public void fillBytes() {
	Arrays.fill(arrayB, b);
	}

	@Benchmark
	public void manualFillBytes() {
	fill(arrayB, b);
	}
	public static void fill(byte[] a, byte val) {
	// Copied from Arrays.fill
	for (int i = 0, len = a.length; i < len; i++)
	a[i] = val;
	}

	@Benchmark
	public void manualReversedFillBytes() {
	rfill(arrayB, b);
	}
	public static void rfill(byte[] a, byte val) {
	for (int i = a.length - 1; i >= 0; i--)
	a[i] = val;
	}

	@Benchmark
	public void copyBytes() {
	System.arraycopy(arrayB, 0, arrayB2, 0, arrayB.length);
	}

	@Benchmark
	public void manualCopyBytes() {
	for (int i = 0; i < arrayB.length; i++) {
	arrayB[i] = arrayB2[i];
	}
	}

view raw gistfile1.java hosted with ❤ by GitHub

And the results are (Oracle JDK8u40/i7-4770@3.40GHz/Ubuntu, array is 32K in size)?
ArrayFill.fillBytes 561.540 ± 10.814 ns/op
ArrayFill.manualFillBytes 557.901 ± 5.255 ns/op
ArrayFill.manualReversedFillBytes 1017.856 ± 0.425 ns/op
ArrayFill.copyBytes 1300.313 ± 13.482 ns/op
ArrayFill.manualCopyBytes 1477.442 ± 13.030 ns/op

We can verify that the call out to the JVM fill method happens for fillBytes/manualFillBytes by printing out the assembly:

	; fillBytes:
	0x000000010d1e49bc: mov rdi,QWORD PTR [rsi+0x48] ;*getfield arrayB, ArrayFill::fillBytes@(line 72)
	0x000000010d1e49c0: mov r10d,DWORD PTR [rdi+0x10];*arraylength, Arrays::fill@(line 2986) <- ArrayFill::fillBytes@(line 72)
	; implicit exception: dispatches to 0x000000010d1e4a1d
	0x000000010d1e49c4: test r10d,r10d
	0x000000010d1e49c7: jle 0x000000010d1e49f5 ;*if_icmpge, Arrays::fill@(line 2986)<- ArrayFill::fillBytes@(line 72)
	0x000000010d1e49c9: movsx r11d,BYTE PTR [rsi+0x44] ;*getfield b, ArrayFill::fillBytes@(line 72)
	0x000000010d1e49ce: test r10d,r10d
	0x000000010d1e49d1: jbe 0x000000010d1e4a01
	0x000000010d1e49d3: mov r8d,r10d
	0x000000010d1e49d6: dec r8d
	0x000000010d1e49d9: cmp r8d,r10d
	0x000000010d1e49dc: jae 0x000000010d1e4a01 ; Arrays::fill@(line 2987)<-ArrayFill::fillBytes@(line 72)
	0x000000010d1e49de: add rdi,0x18
	0x000000010d1e49e2: movsxd rdx,r10d
	0x000000010d1e49e5: mov esi,r11d
	0x000000010d1e49e8: movabs r10,0x10d086be0
	; ********* Calling fill method stub *********
	0x000000010d1e49f2: call r10 ;*synchronization entry, ArrayFill::fillBytes@-1(line 72)
	0x000000010d1e49f5: add rsp,0x20
	0x000000010d1e49f9: pop rbp
	0x000000010d1e49fa: test DWORD PTR [rip+0xffffffffff604600],eax; {poll_return}
	0x000000010d1e4a00: ret

	; manualFillBytes:
	; ... similar shit goes on
	0x000000010d70861e: add rdi,0x18
	0x000000010d708622: movsxd rdx,r10d
	0x000000010d708625: mov esi,r11d
	0x000000010d708628: movabs r10,0x10d5a5be0
	; ********* Calling fill method stub *********
	0x000000010d708632: call r10 ;*synchronization entry, ArrayFill::manualFillBytes@-1(line 77)

	0x000000010d708635: add rsp,0x20
	0x000000010d708639: pop rbp
	0x000000010d70863a: test DWORD PTR [rip+0xfffffffffe6c59c0],eax # 0x000000010bdce000
	; {poll_return}
	0x000000010d708640: ret

	; manualReveredFillBytes
	; ... similar loading of fields and shit, BUT THEN:
	0x000000010bed421e: cmp r10d,0xe
	0x000000010bed4222: jle 0x000000010bed425b
	0x000000010bed4224: vmovd xmm0,ebx
	0x000000010bed4228: vpunpcklbw xmm0,xmm0,xmm0
	0x000000010bed422c: vpshuflw xmm0,xmm0,0x0
	0x000000010bed4231: vpunpcklqdq xmm0,xmm0,xmm0
	0x000000010bed4235: data32 data32 nop WORD PTR [rax+rax*1+0x0] ;ArrayFill::rfill@(line 91)<- ArrayFill::manualReversedFillBytes@(line 87)
	0x000000010bed4240: mov r8d,r10d
	0x000000010bed4243: add r8d,0xfffffffffffffff2
	0x000000010bed4247: movsxd r8,r8d
	0x000000010bed424a: vmovdqu XMMWORD PTR [r11+r8*1+0x17],xmm0 ;ArrayFill::rfill@(line 91)<-ArrayFill::manualReversedFillBytes@(line 87)

	0x000000010bed4251: add r10d,0xfffffffffffffff0 ;ArrayFill::rfill@13 (line 90)<-ArrayFill::manualReversedFillBytes@(line 87)

	0x000000010bed4255: cmp r10d,0xe
	0x000000010bed4259: jg 0x000000010bed4240 ; ArrayFill::rfill@6 (line 90)<-ArrayFill::manualReversedFillBytes@(line 87)

	0x000000010bed425b: cmp r10d,0xffffffffffffffff
	0x000000010bed425f: jle 0x000000010bed4275
	0x000000010bed4261: data32 xchg ax,ax ;ArrayFill::rfill@(line 91)<-ArrayFill::manualReversedFillBytes@(line 87)

	0x000000010bed4264: movsxd r8,r10d
	0x000000010bed4267: mov BYTE PTR [r11+r8*1+0x18],bl ;ArrayFill::rfill@(line 91)<-ArrayFill::manualReversedFillBytes@8(line 87)
	0x000000010bed426c: dec r10d ;ArrayFill::rfill@(line 90)<-ArrayFill::manualReversedFillBytes@(line 87)
	0x000000010bed426f: cmp r10d,0xffffffffffffffff
	0x000000010bed4273: jg 0x000000010bed4264 ;ArrayFill::manualReversedFillBytes@(line 87)

	0x000000010bed4275: add rsp,0x20
	0x000000010bed4279: pop rbp
	0x000000010bed427a: test DWORD PTR [rip+0xfffffffffe68bd80],eax ; {poll_return}
	0x000000010bed4280: ret

view raw gistfile1.asm hosted with ❤ by GitHub

So what have we learnt so far:

Use System.arrayCopy, it is better than your handrolled loop. But surprisingly not hugely better, hmmm.
You don't have to use Arrays.fill, you can roll your own and it works the same. Notice the call out to the fill method. But...
Don't get too creative rolling your own. If you get too funky (like filling the array backwards) it'll fall apart and the 'intrinsic' won't happen. But do note that the reverse fill still has some of that good SIMD stuff going, we'll get to that in a sec.

Are The Other Types Filling The Love?

It all sounds great don't it? Let's see how this pans out for other types. We'll be filling an array of 32KB. To be uniform across data types that means a 16K chars/shorts array, an 8K ints/floats array and a 4K array of longs. I added an 8K array of objects, which is the same size for compressed oops on the Oracle JVM (reference size is 4 bytes, same as an int).
The JMH benchmark code is as you'd expect:

	@Setup
	public void setup() {
	length = 1 << pow2size;
	arrayB = new byte[length];
	arrayC = new char[length/2];
	arrayS = new short[length/2];
	arrayI = new int[length/4];
	arrayF = new float[length/4];
	arrayL = new long[length/8];
	arrayD = new double[length/8];
	arrayO = new Object[length/4];
	}

	@Benchmark
	public void fillBytes() {
	Arrays.fill(arrayB, b);
	}
	@Benchmark
	public void fillChars() {
	Arrays.fill(arrayC, c);
	}

	@Benchmark
	public void fillShorts() {
	Arrays.fill(arrayS, s);
	}

	@Benchmark
	public void fillInts() {
	Arrays.fill(arrayI, i);
	}

	@Benchmark
	public void fillFloats() {
	Arrays.fill(arrayF, f);
	}

	@Benchmark
	public void fillLongs() {
	Arrays.fill(arrayL, l);
	}
	@Benchmark
	public void fillDoubles() {
	Arrays.fill(arrayD, d);
	}

	@Benchmark
	public void fillObjects() {
	Arrays.fill(arrayO, o);
	}

view raw gistfile1.java hosted with ❤ by GitHub

Here's some reasonable expectations:

If no optimizations are present, wider writes are more efficient. It follows that the longFill would be the fastest. But...
Given a clever compiler the fill loop is replaced with the widest writes possible, so there should be no significant difference. But the fill optimization does not cover double/long/object arrays, so we might expect longFill to be the worst performer.
An objects array is not that different from an int array, so performance should be similar. Sure there's a write barrier, but it need only be done once per card (not once for the whole array as I thought initially, god bless Shipilev and PrintAssembly), so that's an extra byte write per card of elements filled. A card is per 512 bytes, each element is 4 bytes, so that's one card per 128 elements. Given there is no fill method implemented for it we may expect it to be slightly worse than the longFill.
We should not rely on expectations, because performance is best measured.

As you'd expect the results are somewhat different than the expectations (Oracle JDK8u40/i7-4770@3.40GHz/Ubuntu):

ArrayFill.fillBytes 561.540 ± 10.814 ns/op

ArrayFill.fillChars 541.901 ± 4.833 ns/op

ArrayFill.fillShorts 532.936 ± 4.508 ns/op

ArrayFill.fillInts 543.165 ± 3.788 ns/op

ArrayFill.fillFloats 537.599 ± 2.323 ns/op

ArrayFill.fillLongs 326.770 ± 3.267 ns/op

ArrayFill.fillDoubles 346.840 ± 5.786 ns/op

ArrayFill.fillObjects 4388.242 ± 11.945 ns/op

Say WOT?
For bytes/chars/shorts/ints/floats Arrays.fill performs very similarly. This much is as expected from the second point above. But filling an array of longs/doubles is better than the others. The funny thing is, there's no fill function implemented for the long array, how come it is so darn quick? Also, why does the objects fill suck quite so badly when compared with the rest (I will not be addressing this last question! I refuse! this post is too fucking long as it is!)?
This is what happens when we turn off the OptimizeFill flag:

ArrayFill.fillBytes 1013.803 ± 0.227 ns/op
ArrayFill.fillChars 323.806 ± 3.879 ns/op
ArrayFill.fillShorts 323.689 ± 4.499 ns/op
ArrayFill.fillInts 326.336 ± 1.559 ns/op
ArrayFill.fillFloats 319.449 ± 2.048 ns/op
ArrayFill.fillLongs 328.692 ± 3.282 ns/op
ArrayFill.fillDoubles 345.035 ± 6.362 ns/op
ArrayFill.fillObjects 4397.130 ± 7.161 ns/op

Strange innit? now we got char/int/long arrays all performing similarly. In fact, with the exception of the byte array, everything is better than it was with the optimization.

Superword to the rescue!

Turns out the JIT compiler is clued up on the topic of SIMD parallelisation by way of Superword Level Parallelism (see the original paper here):

In some respects, superword level parallelism is a restricted form of ILP (Instruction Level Parallelism). ILP techniques have been very successful in the general purpose computing arena, partly because of their ability to find parallelism within basic blocks. In the same way that loop unrolling translates loop level parallelism into ILP, vector parallelism can be transformed into SLP. This realization allows for the parallelization of vectorizable loops using the same basic block analysis. As a result, our algorithm does not require any of the complicated loop transformations typically associated with vectorization. In fact, vector parallelism alone can be uncovered using a simplified version of the SLP compiler algorithm.
...
Superword level parallelism is defined as short SIMD parallelism in which the source and result operands of a SIMD operation are packed in a storage location.
...
Vector parallelism is a subset of superword level parallelism.

The Hotspot compiler implements SLP optimizations in superword.cpp and you are invited to dive into the implementation if you like. I'm going to focus on it's impact here, and to do that I only need to know how to turn it on and off (core competency for any software person). It's on by default, so above results are what happens when it is on, here's what life looks like when it is off too (so -XX:-OptimizeFill -XX:-UseSuperWord):
ArrayFill.fillBytes 8501.270 ± 2.896 ns/op
ArrayFill.fillChars 4286.161 ± 4.935 ns/op
ArrayFill.fillShorts 4286.168 ± 3.146 ns/op
ArrayFill.fillInts 2152.489 ± 2.653 ns/op
ArrayFill.fillFloats 2140.649 ± 2.587 ns/op
ArrayFill.fillLongs 1105.926 ± 2.228 ns/op
ArrayFill.fillDoubles 1105.820 ± 2.393 ns/op
ArrayFill.fillObjects 4392.506 ± 11.678 ns/op

Life is revealed in all it's sucky splendour! This is what happens when the compiler shows you no love... did I say no love? hang on, things can get a bit worse.

Detour: Unsafe? We don't serve you kind here

To all the Unsafe fans, I got some sad news for y'all. Unsafe 'fill' loops are not well loved by the compiler. This is the price of stepping off the beaten path I guess. Consider the following benchmark:

	@Benchmark
	public void unsafeFillOffheapBytes() {
	for (int i = 0; i < length; i++) {
	UNSAFE.putByte(address + i, b);
	}
	}
	@Benchmark
	public void unsafeFillOnHeapBytes() {
	for (int i = 0; i < arrayB.length; i++) {
	UNSAFE.putByte(arrayB, B_ARRAY_OFFSET + i, b);
	}
	}

view raw gistfile1.java hosted with ❤ by GitHub

The results are:
ArrayFill.unsafeFillOffheapBytes 9742.621 ± 2.270 ns/op
ArrayFill.unsafeFillOnHeapBytes 12640.019 ± 1.977 ns/op
ArrayFill.fillBytes(for reference) 561.540 ± 10.814 ns/op
The Unsafe variant do not enjoy the 'fill' pattern matching magic, nor do they get the SuperWord optimizations. What can you do? For this kind of thing you should use the Unsafe.setMemory method instead:

	@Benchmark
	public void unsafeSetOnHeapBytes() {
	UNSAFE.setMemory(arrayB, B_ARRAY_OFFSET, length, b);
	}
	@Benchmark
	public void unsafeSetOffheapBytes() {
	UNSAFE.setMemory(address, length, b);
	}

view raw gistfile1.java hosted with ❤ by GitHub

With the result:
ArrayFill.unsafeSetOffheapBytes 1259.281 ± 21.294 ns/op
ArrayFill.unsafeSetOnHeapBytes 1275.158 ± 27.950 ns/op
Not quite there, still ~2x worse (why? how come it doesn't just call the bytes fill method? a bit of digging shows it ends up calling the underlying platform's memset...) but beats being 20-25x worse like the handrolled method is.

Summary and Musings

It's the circle of life!

So what did we learn:

There's another kind of 'intrinsic' like optimization, which uses pattern matching to swap a block of code rather than a method. This is employed for memset like memory fill loops (in particular Arrays.fill) intrinsicfication. It's not an intrinsic technically, but you know what I fucking mean.
System.arrayCopy/Arrays.fill implementations utilize SIMD instructions to improve their efficiency. These instructions are not available in plain Java, so some compiler intervention is required.
The JIT compiler is also able to use SuperWord Level Parallelism to derive SIMD code from 'normal' sequential code.
In the case of Arrays.fill, it looks like the SuperWord optimized code is faster than the fill specialized implementation for all types except bytes (on the system under test)
If you use Unsafe you will be excluded from these optimizations.

So I look at this process and I imagine history went something like this:

We want to use SIMD instructions, but the JIT compiler isn't really clever enough to generate them by itself. Memset implementations are rather specialized after all. Let's make life a bit easier for the compiler by creating an intrinsic. We'll even go the extra mile and make an effort to automatically identify opportunities to use this intrinsic, so now it's not really an intrinsic any more. The Arrays.fill optimization is available on Oracle JDK6u45 (the oldest I keep around, maybe it was there a while before that) and on that JVM it is twice as fast as the SLP generated code.

Over time, SLP gets better and eventually the compiler is now good enough to optimize the fill loop by itself and beat the specialized method. That is an awesome thing. We just need to remove the training wheels now.

And there's a final punch line to this story. Memset/Memcpy are such common and important opportunities for optimization, so Intel has decided to offer an assembly 'recipe' for them and save everyone the effort in writing them:

3.7.6 Enhanced REP MOVSB and STOSB operation (ERMSB)
Beginning with processors based on Intel microarchitecture code name Ivy Bridge, REP string operation using MOVSB and STOSB can provide both flexible and high-performance REP string operations for soft- ware in common situations like memory copy and set operations. Processors that provide enhanced MOVSB/STOSB operations are enumerated by the CPUID feature flag: CPUID:(EAX=7H, ECX=0H):EBX.ERMSB[bit 9] = 1. - [From the Intel Optimization Manual(September 2014)]

From the manual it seems that this method of implementing memcpy/memset can perform well, but like anything else, YMMV (the intel manual discussion of the performance differences is in itself interesting both on the results and the methodology level). One obvious advantage of this method is that it results in much much smaller code that should be trivial to inline into callers. This will however put the SuperWord method at a slight disadvantage, and the tide will change again.
[UPDATE 14/03/2015: It seems the good folks of Oracle have considered and rejected the use of REP MOVSB for array copy.]
Thanks go to the kind reviewers Peter 'Massive' Hughes, Darrach and the Shipster