Thursday, 13 February 2014

When I say final, I mean FINAL!

{This post is part of a long running series on lock free queues, checkout the full index to get more context here}
Having recently bitched about the lack of treatment of final field as final I was urged by Mr. Shipilev to demonstrate the issue in a more structured way (as opposed to a drunken slurred rant), I have now recovered my senses to do just that. The benchmark being run and the queue being discussed are covered in this post, so please refresh you memory for context if you need. The point is clear enough without full understanding of the context though.
It is perhaps a fact well known to those who know it well that final fields, while providing memory visibility guarantees, are not actually immutable. One can always use reflection, or Unsafe, to store new values into those fields, and in fact many people do (and Cliff Click hates them and wishes them many nasty things). This is (I believe) the reason behind some seemingly trivial optimizations not being done by the JIT compiler.

Code Under Test: FFBufferWithOfferBatch.poll()

private long offset(long index) {
return ARRAY_BASE + ((index & mask) << ELEMENT_SHIFT);
}
public E poll() {
final long offset = offset(head);
// LOAD/LOAD barrier
final E e = (E) UnsafeAccess.UNSAFE.getObjectVolatile(buffer, offset);
if (null == e)
return null;
// STORE/STORE barrier
UnsafeAccess.UNSAFE.putOrderedObject(buffer, offset, null);
head++;
return e;
}
view raw gistfile1.java hosted with ❤ by GitHub
The buffer field is a final field of FFBufferWithOfferBatch and is being accessed twice in the method above. A trivial optimization on the JIT compiler side would be to load it once into a register and reuse the value. It is 'immutable' after all. But if we look at the generate assembly (here's how to, I also took the opportunity to try out JITWatch which is brilliant):
# {method} 'poll' '()Ljava/lang/Object;' in 'io/jaq/spsc/FFBuffer' BEFORE
# [sp+0x20] (sp of caller)
0x00007facb1060c80: mov r10d,DWORD PTR [rsi+0x8]
0x00007facb1060c84: cmp rax,r10
0x00007facb1060c87: jne 0x00007facb1037960 ; {runtime_call}
0x00007facb1060c8d: xchg ax,ax
[Verified Entry Point]
0x00007facb1060c90: sub rsp,0x18
0x00007facb1060c97: mov QWORD PTR [rsp+0x10],rbp ;*synchronization entry
; - io.jaq.spsc.FFBuffer::poll@-1 (line 133)
0x00007facb1060c9c: mov r10,QWORD PTR [rsi+0x90]
0x00007facb1060ca3: and r10,QWORD PTR [rsi+0x198]
;*invokevirtual getObjectVolatile
; - io.jaq.spsc.FFBuffer::poll@17 (line 135)
0x00007facb1060caa: mov r11d,DWORD PTR [rsi+0x9c]
;*getfield buffer
; - io.jaq.spsc.FFBuffer::poll@13 (line 135)
0x00007facb1060cb1: mov r8d,DWORD PTR [r11+r10*4+0x90]
;*invokevirtual getObjectVolatile
; - io.jaq.spsc.FFBuffer::poll@17 (line 135)
0x00007facb1060cb9: test r8d,r8d
0x00007facb1060cbc: je 0x00007facb1060ce3 ;*invokevirtual putOrderedObject
; - io.jaq.spsc.FFBuffer::poll@37 (line 139)
0x00007facb1060cbe: mov r9d,DWORD PTR [rsi+0x9c] ;*getfield buffer
; - io.jaq.spsc.FFBuffer::poll@32 (line 139)
0x00007facb1060cc5: mov DWORD PTR [r9+r10*4+0x90],r12d
;*invokevirtual putOrderedObject
; - io.jaq.spsc.FFBuffer::poll@37 (line 139)
0x00007facb1060ccd: inc QWORD PTR [rsi+0x198] ;*putfield head
; - io.jaq.spsc.FFBuffer::poll@47 (line 140)
0x00007facb1060cd4: mov rax,r8 ;*invokevirtual getObjectVolatile
; - io.jaq.spsc.FFBuffer::poll@17 (line 135)
0x00007facb1060cd7: add rsp,0x10
0x00007facb1060cdb: pop rbp
0x00007facb1060cdc: test DWORD PTR [rip+0xc6df31e],eax # 0x00007facbd740000
; {poll_return}
0x00007facb1060ce2: ret
0x00007facb1060ce3: xor eax,eax
0x00007facb1060ce5: jmp 0x00007facb1060cd7
view raw gistfile1.asm hosted with ❤ by GitHub

We can see buffer is getting loaded twice (line 15, and again at line 24). Why doesn't JIT do the optimization? I'm not sure... it may be due to the volatile load forcing a load order that could in theory require the 'new' value in buffer to be made visible... I don't know.

Hack around it, see if it makes a difference

Is that a big deal? Let's find out. The fix is trivial:
public E poll() {
final long offset = offset(head);
final E[] lb = buffer;
@SuppressWarnings("unchecked")
final E e = (E) UnsafeAccess.UNSAFE.getObjectVolatile(lb, offset);
if (null == e) {
return null;
}
UnsafeAccess.UNSAFE.putOrderedObject(lb, offset, null);
head++;
return e;
}
view raw gistfile1.java hosted with ❤ by GitHub

And the assembly code generated demonstrates the right behaviour now (one load at line 15):
# {method} 'poll' '()Ljava/lang/Object;' in 'io/jaq/spsc/FFBuffer' AFTER
# [sp+0x20] (sp of caller)
0x00007f885ca2ccc0: mov r10d,DWORD PTR [rsi+0x8]
0x00007f885ca2ccc4: cmp rax,r10
0x00007f885ca2ccc7: jne 0x00007f885ca03960 ; {runtime_call}
0x00007f885ca2cccd: xchg ax,ax
[Verified Entry Point]
0x00007f885ca2ccd0: sub rsp,0x18
0x00007f885ca2ccd7: mov QWORD PTR [rsp+0x10],rbp ;*synchronization entry
; - io.jaq.spsc.FFBuffer::poll@-1 (line 133)
0x00007f885ca2ccdc: mov r10,QWORD PTR [rsi+0x90]
0x00007f885ca2cce3: and r10,QWORD PTR [rsi+0x198]
;*invokevirtual getObjectVolatile
; - io.jaq.spsc.FFBuffer::poll@19 (line 136)
0x00007f885ca2ccea: mov r11d,DWORD PTR [rsi+0x9c]
;*getfield buffer
; - io.jaq.spsc.FFBuffer::poll@10 (line 134)
0x00007f885ca2ccf1: mov r9d,DWORD PTR [r11+r10*4+0x90]
;*invokevirtual getObjectVolatile
; - io.jaq.spsc.FFBuffer::poll@19 (line 136)
0x00007f885ca2ccf9: test r9d,r9d
0x00007f885ca2ccfc: je 0x00007f885ca2cd1c
0x00007f885ca2ccfe: mov DWORD PTR [r11+r10*4+0x90],r12d
;*invokevirtual putOrderedObject
; - io.jaq.spsc.FFBuffer::poll@38 (line 140)
0x00007f885ca2cd06: inc QWORD PTR [rsi+0x198] ;*putfield head
; - io.jaq.spsc.FFBuffer::poll@48 (line 141)
0x00007f885ca2cd0d: mov rax,r9 ;*invokevirtual getObjectVolatile
; - io.jaq.spsc.FFBuffer::poll@19 (line 136)
0x00007f885ca2cd10: add rsp,0x10
0x00007f885ca2cd14: pop rbp
0x00007f885ca2cd15: test DWORD PTR [rip+0x9f2c2e5],eax # 0x00007f8866959000
; {poll_return}
0x00007f885ca2cd1b: ret
0x00007f885ca2cd1c: xor eax,eax
0x00007f885ca2cd1e: jmp 0x00007f885ca2cd10
view raw gistfile1.asm hosted with ❤ by GitHub

Now, was that so hard to do? And more importantly, does it make any difference to performance? As discussed previously, the throughput benchmark is sensitive to changes in the cost balance between offer/poll. The optimization creates an interesting change in the pattern of the results:
The benchmark is run on Ubuntu13.10/JDK7u45/i7@2.4, the x axis is the index of the benchmark run and the Y axis is the result in ops/sec. The chart displays the results for before the change (B-*) and after(A-*) with different sparse data settings. We can see the change has accelerated the consumer, leading to increased benefit from sparse data that was not visible before. With sparse data set to 1 the optimization results in a 2% increase in performance. Not mind blowing, but still. The same change applied to the producer thread loop (localizing the reference to the queue field) discussed in the previous post enabled a 10% difference in performance as the field reference stopped the loop from unrolling and was read on each iteration. I used the poll() example here because it involves allot less assembly code to wade through.

Hopefully this illustrates the issue to Mr. Shipilev's content. Thanks goes to Gil Tene for pointing out the optimization to me and to Chris Newland for JITWatch.

4 comments:

  1. Thanks Nitsan. I've been aware of this at the superstitious level of "if I read a field twice in hot code, best to take a local reference", but never taken the time to quantify the cost. You've tied it up nicely, with a bow.

    Now get drunk and rant yourself some more ideas.

    ReplyDelete
    Replies
    1. I have run out of whiskey... contributions are welcome :P

      Delete
  2. How does one understand the barrier mechanism ? I know this is a deep question. But developers cannot easily understand this.

    ReplyDelete
  3. Should re-run and update this with the latest Zing Falcon JIT results, which now does Truly Final field optimization.

    ReplyDelete

Note: only a member of this blog may post a comment.