Psychosomatic, Lobotomy, Saw: February 2014

Love running? Love scissors? I know just the thing for you! Following on from recent discussion on the Mechanical Sympathy mailing list I see an anti pattern worth correcting in the way people use Unsafe. I say correcting as I doubt people are going to stop, so they might as well be made aware of the pitfalls. This pattern boils down to a classic concurrency bug:

	class Foo{
	volatile Foo next;

	Foo getNextNext(){
	// Commented out code as reminder of silly bug
	// if (next != null) {
	// // This can still result in NPE, next can change between reads
	// return next.next;
	// }
	// This is how we do it!
	Foo currNextVal = next;
	if (currNextVal != null) {
	return currNextVal.next;
	}
	return null;
	}
	}

view raw gistfile1.java hosted with ❤ by GitHub

Q: "But... I not be doing no concurrency or nuffin' guv"
A: Using Unsafe to gain a view of on-heap addresses is concurrent access by definition.

Unsafe address: What is it good for?

Absolutely nothing! sayitagain-huh! I exaggerate, if it was good for nothing it would not be there, let's look at the friggin manual:

	/**
	* Allocates a new block of native memory, of the given size in bytes.The
	* contents of the memory are uninitialized; they will generally be
	* garbage.The resulting native pointer will never be zero, and will be
	* aligned for all value types.Dispose of this memory by calling {@link
	* #freeMemory}, or resize it with {@link #reallocateMemory}.
	*
	* @throws IllegalArgumentException if the size is negative or too large
	* for the native size_t type
	*
	* @throws OutOfMemoryError if the allocation is refused by the system
	*/
	public native long allocateMemory(long bytes);

	/**
	* Fetches a native pointer from a given memory address.If the address is
	* zero, or does not point into a block obtained from {@link
	* #allocateMemory}, the results are undefined.
	* <p> If the native pointer is less than 64 bits wide, it is extended as
	* an unsigned number to a Java long.The pointer may be indexed by any
	* given byte offset, simply by adding that offset (as a simple integer) to
	* the long representing the pointer.The number of bytes actually read
	* from the target address maybe determined by consulting {@link
	* #addressSize}.
	*/
	public native long getAddress(long address);

	/**
	* Stores a native pointer into a given memory address.If the address is
	* zero, or does not point into a block obtained from {@link
	* #allocateMemory}, the results are undefined.
	* <p> The number of bytes actually written at the target address maybe
	* determined by consulting {@link #addressSize}.
	*/
	public native void putAddress(long address, long x);

view raw gistfile1.java hosted with ❤ by GitHub

As we can see the behaviour is only defined if we use the methods together, and by that I mean that get/putAddress are only useful when used with an address that is within a block of memory allocated by allocateMemory. Now undefined is an important word here. It means it might work some of the time... or it might not... or it might crash your VM. Let's think about this.

Q: What type of addresses are produced by allocateMemory?

A: Off-Heap memory addresses -> unmanaged memory, not touched by GC or any other JVM processes

The off-heap addresses are stable from the VM point of view. It has no intention of running around changing them, once allocated they are all yours to manage and if you cut your fingers in the process or not is completely in your control, this is why the behaviour is defined. On-Heap addresses on the other hand are a different story.

Playing With Fire: Converting An Object Ref to An Address

So imagine you just had to know the actual memory address of a given instance... perhaps you just can't resist a good dig under the hood, or maybe you are concerned about memory layout... Here's how you'd go about it:

	static final int REF_SIZE = ...;
	static final int OBJECT_ARRAY_BASE = UNSAFE.arrayBaseOffset((Object[].class);
	static final boolean USE_COMPRESSED_REFS = ...;
	static final int COMPRESSED_REF_SHIFT = ...;
	public static long addressOf(Object o) {
	return addressOf(o, REF_SIZE);
	}

	public static long addressOf(Object o, int oopSize) {
	Object[] array = new Object[1];

	array[0] = o;

	long objectAddress;
	switch (oopSize) {
	case 4:
	objectAddress = UNSAFE.getInt(array, OBJECT_ARRAY_BASE) & 0xFFFFFFFFL;
	break;
	case 8:
	objectAddress = UNSAFE.getLong(array, OBJECT_ARRAY_BASE);
	break;
	default:
	throw new Error("unsupported address size: " + oopSize);
	}

	array[0] = null;

	return toNativeAddress(objectAddress);
	}
	public static long toNativeAddress(long address) {
	if (USE_COMPRESSED_REFS) {
	return address << COMPRESSED_REF_SHIFT;
	} else {
	return address;
	}
	}

view raw gistfile1.java hosted with ❤ by GitHub

Now... you'll notice the object ref needs a bit of cuddling to turn into an address. Did I come up with such devilishly clever code myself? No... I will divulge a pro-tip here:

If you are going to scratch around the underbelly of the JVM, learn from as close to the JVM as you can -> from the JDK classes, or failing that, from an OpenJDK project like JOL (another Shipilev production)

In fact, the above code could be re-written to:

	// Let Shipilev do all the heavy lifting!
	import org.openjdk.jol.util.VMSupport;
	...

	// Suddenly, I feel the urge to get an object address
	long address = VMSupport.addressOf(anUnsuspectingObject);

	// Satisfied. I move on to never using that address...
	...

view raw gistfile1.java hosted with ❤ by GitHub

Now that we have the address what can we do with it? Could we use it to copy the object? maybe we could read or modify the object state? NO! we can but admire it's numerical beauty and muse on the temperamental values waiting at the other end of that address. The value at the other end of this address may have already been moved by GC...

Key Point: On-Heap Addresses Are NOT Stable

Consider the fact that at any time your code may be paused and the whole heap can be moved around... any address value you had which pointed to the heap is now pointing to a location holding data which may be trashed/outdated/wrong and using that data will lead to a funky result indeed. Also consider that this applies to class metadata or any other internal accounting managed by the JVM.
If you are keen to use Unsafe in the heap, use object references, not addresses. I would urge you not to mix the 2 together (i.e. have object references to off-heap memory) as that can easily lead to a very confused GC trying to chase references into the unknown and crashing your VM.

Case Study: SizeOf an Object (Don't do this)

This dazzling fit of hackery cropped up first (to my knowledge) here on the HighScalability blog:

	public static long sizeOf(Object object) {
	Unsafe unsafe = getUnsafe();
	// Original : return unsafe.getAddress( normalize( unsafe.getInt(object, 4L) ) + 12L );
	// This is my elaborate breakdown of original one liner
	int addressOfKlassInObjectHeader = unsafe.getInt(object, 4L);
	long nativeAddressOfKlass = normalize(addressOfKlassInObjectHeader);
	long addressOfLayoutHelper = nativeAddressOfKlass + 12L;
	return unsafe.getAddress(addressOfLayoutHelper);
	}

	public static long normalize(int value) {
	if(value >= 0) return value;
	return (~0L >>> 32) & value;
	}

view raw gistfile1.java hosted with ❤ by GitHub

This is some sweet macheta swinging action :-). The dude who wrote this is not suggesting it is safe, and only claims it is correct on a 32bit VM. And indeed, it can work and passes cursory examination. The author also states correctly that this will not work for arrays and that with some corrections this can be made to work for 64 bit JVMs as well. I'm not going to try and fix it for 64 bit JVMs, though most of the work is already done in the JOL code above. The one flaw in this code that cannot be reliably fixed is that it relies on the native Klass address (line 6) to remain valid long enough for it to chase the pointer through to read the layout helper (line 8). Spot the similarity to the volatile bug above?
This same post demonstrates how to forge references from on-heap objects to off-heap 'objects' which in effect let you cast a pointer to a native reference to an object. It goes on to state that is a BAD IDEA, and indeed it can easily crash your VM when GC comes a knocking (but it might not, I didn't try).

Case Study: Shallow Off-Heap Object Copy (Don't do this)

Consider the following method of making an off-heap copy of an object (from here, Mishadof's blog):

	static Object shallowCopy(Object obj) {
	long size = sizeOf(obj);
	long start = toAddress(obj);
	long address = getUnsafe().allocateMemory(size);
	getUnsafe().copyMemory(start, address, size);
	return fromAddress(address);
	}

	static long toAddress(Object obj) {
	Object[] array = new Object[] {obj};
	long baseOffset = getUnsafe().arrayBaseOffset(Object[].class);
	return normalize(getUnsafe().getInt(array, baseOffset));
	}

	static Object fromAddress(long address) {
	Object[] array = new Object[] {null};
	long baseOffset = getUnsafe().arrayBaseOffset(Object[].class);
	getUnsafe().putLong(array, baseOffset, address);
	return array[0];
	}

	static long sizeOf(Object object){
	return getUnsafe().getAddress(
	normalize(getUnsafe().getInt(object, 4L)) + 12L);
	}

	static long normalize(int value) {
	if(value >= 0) return value;
	return (~0L >>> 32) & value;
	}

view raw gistfile1.java hosted with ❤ by GitHub

We see the above is using the exact same method for computing size as demonstrated above. It's getting the on-heap object address (limited correctness, see addresses discussion above) than copying the object off-heap and reading it back as a new object copy... Calling the Unsafe.copyMemory(srcAddress, destAddress, length) is inviting the same concurrency bug discussed above. A similar method is demonstrated in the HighScalability post, but there the copy method used is Unsafe.copyMemory(srcRef, srcOffset, destRef, destOffset, length). This is important as the reference using method is not exposed to the same concurrency issue.
Both are playing with fire ofcourse by converting off-heap memory to objects. Imagine this scenario:

a copy of object A is made which refers to another object B, the copy is presented as object C
object A is de-referenced leading to A and B being collected in the next GC cycle
object C is still storing a stale reference to B which is no managed by the VM

What will happen if we read that stale reference? I've seen the VM crash in similar cases, but it might just give you back some garbage values, or let you silently corrupt some other instance state... oh, the fun you will have chasing that bugger down...

Apologies

I don't mean to present either of the above post authors as fools, they are certainly clever and have presented interesting findings for their readers to contemplate without pretending their readers should run along and build on their samples. I have personally commented on some of the code on Mishadof's post and admit my comments were incomplete in identifying the issues discussed above. If anything I aim to highlight that this hidden concurrency aspect can catch out even the clever.

Finally, I would be a hypocrite if I told people not to use Unsafe, I end up using it myself for all sorts of things. But as Mr. Maker keeps telling us "Be careful, because scissors are sharp!"

{This post is part of a long running series on lock free queues, checkout the full index to get more context here}
Having recently bitched about the lack of treatment of final field as final I was urged by Mr. Shipilev to demonstrate the issue in a more structured way (as opposed to a drunken slurred rant), I have now recovered my senses to do just that. The benchmark being run and the queue being discussed are covered in this post, so please refresh you memory for context if you need. The point is clear enough without full understanding of the context though.
It is perhaps a fact well known to those who know it well that final fields, while providing memory visibility guarantees, are not actually immutable. One can always use reflection, or Unsafe, to store new values into those fields, and in fact many people do (and Cliff Click hates them and wishes them many nasty things). This is (I believe) the reason behind some seemingly trivial optimizations not being done by the JIT compiler.

Code Under Test: FFBufferWithOfferBatch.poll()

	private long offset(long index) {
	return ARRAY_BASE + ((index & mask) << ELEMENT_SHIFT);
	}

	public E poll() {
	final long offset = offset(head);
	// LOAD/LOAD barrier
	final E e = (E) UnsafeAccess.UNSAFE.getObjectVolatile(buffer, offset);
	if (null == e)
	return null;

	// STORE/STORE barrier
	UnsafeAccess.UNSAFE.putOrderedObject(buffer, offset, null);
	head++;
	return e;
	}

view raw gistfile1.java hosted with ❤ by GitHub

The buffer field is a final field of FFBufferWithOfferBatch and is being accessed twice in the method above. A trivial optimization on the JIT compiler side would be to load it once into a register and reuse the value. It is 'immutable' after all. But if we look at the generate assembly (here's how to, I also took the opportunity to try out JITWatch which is brilliant):

	# {method} 'poll' '()Ljava/lang/Object;' in 'io/jaq/spsc/FFBuffer' BEFORE
	# [sp+0x20] (sp of caller)
	0x00007facb1060c80: mov r10d,DWORD PTR [rsi+0x8]
	0x00007facb1060c84: cmp rax,r10
	0x00007facb1060c87: jne 0x00007facb1037960 ; {runtime_call}
	0x00007facb1060c8d: xchg ax,ax
	[Verified Entry Point]
	0x00007facb1060c90: sub rsp,0x18
	0x00007facb1060c97: mov QWORD PTR [rsp+0x10],rbp ;*synchronization entry
	; - io.jaq.spsc.FFBuffer::poll@-1 (line 133)
	0x00007facb1060c9c: mov r10,QWORD PTR [rsi+0x90]
	0x00007facb1060ca3: and r10,QWORD PTR [rsi+0x198]
	;*invokevirtual getObjectVolatile
	; - io.jaq.spsc.FFBuffer::poll@17 (line 135)
	0x00007facb1060caa: mov r11d,DWORD PTR [rsi+0x9c]
	;*getfield buffer
	; - io.jaq.spsc.FFBuffer::poll@13 (line 135)
	0x00007facb1060cb1: mov r8d,DWORD PTR [r11+r10*4+0x90]
	;*invokevirtual getObjectVolatile
	; - io.jaq.spsc.FFBuffer::poll@17 (line 135)
	0x00007facb1060cb9: test r8d,r8d
	0x00007facb1060cbc: je 0x00007facb1060ce3 ;*invokevirtual putOrderedObject
	; - io.jaq.spsc.FFBuffer::poll@37 (line 139)
	0x00007facb1060cbe: mov r9d,DWORD PTR [rsi+0x9c] ;*getfield buffer
	; - io.jaq.spsc.FFBuffer::poll@32 (line 139)
	0x00007facb1060cc5: mov DWORD PTR [r9+r10*4+0x90],r12d
	;*invokevirtual putOrderedObject
	; - io.jaq.spsc.FFBuffer::poll@37 (line 139)
	0x00007facb1060ccd: inc QWORD PTR [rsi+0x198] ;*putfield head
	; - io.jaq.spsc.FFBuffer::poll@47 (line 140)
	0x00007facb1060cd4: mov rax,r8 ;*invokevirtual getObjectVolatile
	; - io.jaq.spsc.FFBuffer::poll@17 (line 135)
	0x00007facb1060cd7: add rsp,0x10
	0x00007facb1060cdb: pop rbp
	0x00007facb1060cdc: test DWORD PTR [rip+0xc6df31e],eax # 0x00007facbd740000
	; {poll_return}
	0x00007facb1060ce2: ret
	0x00007facb1060ce3: xor eax,eax
	0x00007facb1060ce5: jmp 0x00007facb1060cd7

view raw gistfile1.asm hosted with ❤ by GitHub

We can see buffer is getting loaded twice (line 15, and again at line 24). Why doesn't JIT do the optimization? I'm not sure... it may be due to the volatile load forcing a load order that could in theory require the 'new' value in buffer to be made visible... I don't know.

Hack around it, see if it makes a difference

Is that a big deal? Let's find out. The fix is trivial:

	public E poll() {
	final long offset = offset(head);
	final E[] lb = buffer;
	@SuppressWarnings("unchecked")
	final E e = (E) UnsafeAccess.UNSAFE.getObjectVolatile(lb, offset);
	if (null == e) {
	return null;
	}
	UnsafeAccess.UNSAFE.putOrderedObject(lb, offset, null);
	head++;
	return e;
	}

view raw gistfile1.java hosted with ❤ by GitHub

And the assembly code generated demonstrates the right behaviour now (one load at line 15):

	# {method} 'poll' '()Ljava/lang/Object;' in 'io/jaq/spsc/FFBuffer' AFTER
	# [sp+0x20] (sp of caller)
	0x00007f885ca2ccc0: mov r10d,DWORD PTR [rsi+0x8]
	0x00007f885ca2ccc4: cmp rax,r10
	0x00007f885ca2ccc7: jne 0x00007f885ca03960 ; {runtime_call}
	0x00007f885ca2cccd: xchg ax,ax
	[Verified Entry Point]
	0x00007f885ca2ccd0: sub rsp,0x18
	0x00007f885ca2ccd7: mov QWORD PTR [rsp+0x10],rbp ;*synchronization entry
	; - io.jaq.spsc.FFBuffer::poll@-1 (line 133)
	0x00007f885ca2ccdc: mov r10,QWORD PTR [rsi+0x90]
	0x00007f885ca2cce3: and r10,QWORD PTR [rsi+0x198]
	;*invokevirtual getObjectVolatile
	; - io.jaq.spsc.FFBuffer::poll@19 (line 136)
	0x00007f885ca2ccea: mov r11d,DWORD PTR [rsi+0x9c]
	;*getfield buffer
	; - io.jaq.spsc.FFBuffer::poll@10 (line 134)
	0x00007f885ca2ccf1: mov r9d,DWORD PTR [r11+r10*4+0x90]
	;*invokevirtual getObjectVolatile
	; - io.jaq.spsc.FFBuffer::poll@19 (line 136)
	0x00007f885ca2ccf9: test r9d,r9d
	0x00007f885ca2ccfc: je 0x00007f885ca2cd1c
	0x00007f885ca2ccfe: mov DWORD PTR [r11+r10*4+0x90],r12d
	;*invokevirtual putOrderedObject
	; - io.jaq.spsc.FFBuffer::poll@38 (line 140)
	0x00007f885ca2cd06: inc QWORD PTR [rsi+0x198] ;*putfield head
	; - io.jaq.spsc.FFBuffer::poll@48 (line 141)
	0x00007f885ca2cd0d: mov rax,r9 ;*invokevirtual getObjectVolatile
	; - io.jaq.spsc.FFBuffer::poll@19 (line 136)
	0x00007f885ca2cd10: add rsp,0x10
	0x00007f885ca2cd14: pop rbp
	0x00007f885ca2cd15: test DWORD PTR [rip+0x9f2c2e5],eax # 0x00007f8866959000
	; {poll_return}
	0x00007f885ca2cd1b: ret
	0x00007f885ca2cd1c: xor eax,eax
	0x00007f885ca2cd1e: jmp 0x00007f885ca2cd10

view raw gistfile1.asm hosted with ❤ by GitHub

Now, was that so hard to do? And more importantly, does it make any difference to performance? As discussed previously, the throughput benchmark is sensitive to changes in the cost balance between offer/poll. The optimization creates an interesting change in the pattern of the results:

The benchmark is run on Ubuntu13.10/JDK7u45/i7@2.4, the x axis is the index of the benchmark run and the Y axis is the result in ops/sec. The chart displays the results for before the change (B-*) and after(A-*) with different sparse data settings. We can see the change has accelerated the consumer, leading to increased benefit from sparse data that was not visible before. With sparse data set to 1 the optimization results in a 2% increase in performance. Not mind blowing, but still. The same change applied to the producer thread loop (localizing the reference to the queue field) discussed in the previous post enabled a 10% difference in performance as the field reference stopped the loop from unrolling and was read on each iteration. I used the poll() example here because it involves allot less assembly code to wade through.

Hopefully this illustrates the issue to Mr. Shipilev's content. Thanks goes to Gil Tene for pointing out the optimization to me and to Chris Newland for JITWatch.

Psychosomatic, Lobotomy, Saw

Tuesday, 25 February 2014

Unsafe Pointer Chasing: Running With Scissors

Unsafe address: What is it good for?

Playing With Fire: Converting An Object Ref to An Address

Key Point: On-Heap Addresses Are NOT Stable

Case Study: SizeOf an Object (Don't do this)

Case Study: Shallow Off-Heap Object Copy (Don't do this)

Apologies

Thursday, 13 February 2014

When I say final, I mean FINAL!

Code Under Test: FFBufferWithOfferBatch.poll()

Hack around it, see if it makes a difference