Tuesday, 12 February 2013

Alignment, Concurrency and Torture (x86)

Summary: Exploring the effects of unaligned/aligned concurrent access in Java for off/on heap memory, how to test for correctness, torture and cushions. 

Concurrent access to memory is a pain at the best of times. The JMM offers Java programmers a way to reason about concurrency in a platform independent manner, and while not without fault it's certainly better then having no memory model at all. Other languages are catching up of course, but the focus of this article is not comparative concurrency, so look it up yourselves ;).
When we go off heap via Unsafe or Direct Byte Buffers, or when we utilize Unsafe to gain direct access to on heap memory we are on our own. Direct memory access strips away the JVM guarantees and leaves you exposed to the underlying OS/architecture memory access rules and the interpretation of the JVM runtime.
Should you be worried?
Silly question.
If you are not the worried type, follow this link with my blessings.
Worried? Here's 2 guarantees now gone to worry about:
  • Atomicity - memory access on the JVM is guaranteed to be atomic, even when the underlying system does not support it. No such similar guarantees are available for ByteBuffers. In fact ByteBuffers are explicitly not thread safe, so atomicity is not guaranteed at all. Unsafe has no guarantees for anything, so caution is advised.
  • Word tearing - this means concurrent updates to adjacent bytes in memory do not result in corruption of state. Some processors perform memory writes in units of a word, which can be 2 or 4 bytes. This can result in the unmolested parts of the word overwriting bytes written to by other threads. While the JLS is binding for 'plain' memory access (via assignment) the guarantee does not necessarily extend to off heap memory.
The problem with verifying these behaviours is that concurrency guarantees and issues are by definition timing dependent. This means that while most of the time there is no issue, that is not to say that throwing a few extra processors in, or changing memory layout, or maybe just waiting a bit longer will not trigger the issue...
So how do you prove your code is thread safe once you stepped off heap and into the wild? I recently answered a question on stack overflow regarding testing for correctness under multi-threaded access of a data structure. My answer was accepted, so obviously it's correct, I cut out the niceties to keep this short:
 1.  Formally prove the correctness of your program in the terms of the JMM. 

 2.  Construct test cases which demonstrate intended above behaviour by using count down latches or other means of suspending threads of execution at specific points in your program to force contention.

 3.  Statistically demonstrate correctness by exercising the code over a sufficient period of time from multiple threads.
This is all fine for when you write 'safe' Java, but option 1, and therefore 2 go out the window when you are straying off spec. This is an important observation to make: if the rules are not well defined then testing the edge cases for those rules is not that meaningful.
Which leaves us with number 3.

Science is torture

It is a fact well known to those who know it well that a scientific experiment can only prove your theory is either definitely wrong or so far right. This (limited) power of induction from past experiences is what empiricism is all about. To prove JVM implementations implement the JMM correctly Mr. Shipilev (Benchmarking Tzar) has put together the Java Concurrency Torture Suite into which he drags JVMs and tries to break them on any number of architectures (not to worry, he tries it on cute bunnies first). It's a great project for anyone trying to reason about the JMM as it gives you examples in code of edge cases and the corresponding range of expected behaviours on all architectures. It also means that if you got questionable behaviours you want to explore on a particular architecture Aleksey just saved you allot of work.
The JCTS project is a very professional piece of work, so I'll try not to mince words over what you can read in the readme. The general approach is that if you leave it (the concurrent behaviour) long enough the edge cases will come crawling out of the woodwork. Here's one bug found using it, the discussion in the ticket is quite interesting.
The tests run also report how many times each state was detected, which has the added benefit of educating us on the distribution of occurrences for the different cases. The next version may support ASCII bar charts too. Instead of yapping on about it, lets see how atomicity is tackled and all will become clear.

 

Biggles! Fetch...THE CUSHIONS!

The JCTS already has atomicity tests, I just had to my own to target unaligned access. I added 2 tests, one for unaligned access within the line and one for crossing the cache line, for brevity this is just the long test:
The method for allocating aligned direct memory byte buffers is described in this post. The cross cache line case is very similar with the position being set intentionally to cause cache straddled access.
I expected the straddled line access to be non-atomic and for once I was right(read on, it doesn't last)!!!
Note the fine printed output, a joy! The test run several times and we can see how in the majority of cases things are fine, the observed result is either 0 or -1. But there are other values which should not be there (when our mother is not), those observed values are the result of the lack of atomicity. Reading the value in some undefined trashed state gives us... mmm... trash. As I suspected! Patting myself on the shoulder I moved on to the unaligned non-straddling tests.
The results suggested that unaligned access inside the cache line is atomic on my machine. I admit it was not what I thought. This is a bit surprising. Good old Intel, they can really make this work!


Our chief weapon is surprise

Going through these experiments and results reminded me of a comment made by Gil Tene (Azul CTO) on Martin Thompson's blog:
Cache alignment is only a performance issue. Never a correctness issue. Cache lines do not add or remove atomicity. Specifically, non atomic updates within a single cache line can still be seen non-atomically by other threads.The Java spec only guarantees atomic treatment (as in the bits in the field are read and written all-or-nothing) for boolean, byte, char, short, int, and float field types. The larger types (long and double) *may* be read and written using two separate operations (although most JVMs will still perform these as atomic, all-or-nothing operations).
BTW, all x86 variants DO support both unaligned data types in memory, as well as LOCK operations on such types. This means that on an x86, a LOCK operations that spans boundary between two cache lines will still be atomic (this little bit of historical compatibility probably has modern x86 implementors cursing each time).
I read the above to suggest Gil was thinking atomicity is a given across the cache line, so I sent him an e-mail to discuss my findings. Now Gil has been under the JVM hood more than most people, and this turned out to be quite enlightening:
I'm not saying that cross cache line boundaries cannot induce non-atomicity to occur more often. I *am* saying that whenever you see non-atomicity on accesses that cross cache line boundaries, that same non-atomicity potential was there [for the same code] even within a single cache line, and is usually there regardless of whether or not the in-same-cache-line access is aligned to the type size boundary. 
He went on to point out that the whole experimentation as means to proving correctness is not possible (I told you it's well known) and that where the spec. doesn't promise you anything, expect nothing:
... there is no way to verify atomicity by experimentation. You can only disprove it, but cannot prove it.
The only way you can control atomicity on a known x86 arch. for off heap access would be to control ALL x86 instructions used to access your data, where knowing exactly which instructions are used and what the layout restrictions you impose would allow you to prove atomicity through known architectural qualities.

 

I didn't expect a kind of Spanish Inquisition

Mmmm... but how would I know? Given there are no guarantees on the one hand and the limitations of experimentation on the other, we are back to square one... And while musing on Gil's wise words I ran the unaligned access test again, and what do you know:

Bugger. Bugger. Bugger. It seems like a special value sneaks in every once in a while. This happens quite infrequently and as things worked out it failed to happen until my little exchange with Gil was over, leading me to believe Gil possesses that power of the experienced to have reality demonstrate for them on cue. On the other hand, now I know it doesn't work. Certainty at last.
What about aligned memory access? I gave it a good run and saw no ill effects, but given Gil's warnings this may not be enough. The JCTS already had a test for aligned access and there are no particular expectations there. This is important: it is OK for a JVM implementation to give you no atomicity on putLong/Int/... (via Unsafe or DirectByteBuffer). Aligned access on x86 is atomic, but as Gil points out we don't have full control on every level so that may not be a strong enough guarantee...

Conclusion? 

The take away from the above is that offheap storage should be treated as if atomicity is not guaranteed, unless read/write alignment is properly handled. Is that the end of the world? certainly not, but it does mean precautions must be taken if data is to be written and read concurrently. Note that mutating shared data is always a risk, but you may go further without noticing issues with a POJO approach. Also note that some of the tools used for proper object publication are not available, in particular final fields.


Many thanks to Aleksey Shipilev and Gil Tene for their feedback and guidance along the way. I leave word tearing as an exercise to the interested reader.

Update 14/02/2013:
Martin Thompson got back to me asking if the same atomicity issue is still there for putOrdered*/put*Volatile when the cache line is crossed. I expected it would be (so did he), wrote some more tests and it turned out we were both correct. It makes no difference if you use put*/putOrdered*/put*Volatile/compareAndSwap* [correction: CAS is atomic see update 23/09/2013 below], if you cross the cache line there is no atomicity. Similarly I would expect unaligned access within the cache line to lack atomicity, but have not had time to add the tests. Code is on GitHub for your pleasure.
Thanks Martin.

Update 24/07/2013:
JCT has gone into the OpenJDK toolbox and is now public again as JCStress (I suggested JangBang, but Shipilev wouldn't listen :-(). I 'forked' it and added my tests to the new project, it's all available here. The tests are under the org.openjdk.jcstress.tests.unsafe.unaligned package.

Update 23/09/2013:
I've written a further post on JCStress here. And a further clarification on the atomicity of unaligned writes using different methods here. The bottom line being that the only method to perform guaranteed atomic unaligned writes is by using CAS. Volatile reads are not atomic however, so this method is of limited use.