This is an attempt to clarify and improve on previous statements I've made here (and before that here) on unaligned memory access and it's implications. Let's start from the top:
- Unaligned access is reading/writing a short/int/long from/to an address that is not divisible by it's size.
- Unaligned access to data on the heap is not possible without using sun.misc.Unsafe.
- Unaligned access to data off the heap is possible via direct ByteBuffer.
- Unaligned access to data is very possible using unsafe both on/off heap.
- Unaligned access is therefore bad news even for good boys and girls using direct ByteBuffer, as the result on unaligned access are architecture/OS specific. You may find that your code crashes inexplicably on some processes when running some OSs.
- Unaligned access atomicity pure and simple is not a JVM issue. You can only "legally" do it by using direct ByteBuffer, and that is explicitly not a thread safe object so you are in no position to expect atomicity (As an aside, in heap ByteBuffers all writes are not atomic)
- What happens on unaligned access stays on unaligned access ;-)
Unaligned access within the cache line
Unaligned access within the cache line is not atomic on older Intel processors, but is atomic on later models (from the intel developer manual under 8.1.1 Guaranteed Atomic Operations):
"The P6 family processors (and newer processors since) guarantee that the following additional memory operation will always be carried out atomically:
To demonstrate this time around I'll use a simpler means than JCStress and just roll my own, the same tests are available on my fork of JCStress here. The same test will be used for the cross line test, and it covers all 3 flavours of visible writes (ordered/volatile,CAS):
- Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit within a cache line"
We expect the test to run forever (or as long as we need to feel comfortable) if the writes are atomic. If they are not we expect the test to write "WTF" and exit early. Running the test for all write variations with an offset of 12 leads to no breakage of atomicity when run on a Core2Duo and Xeon, this test should break (i.e exit early, WTF) on older processors and this result has been confirmed to indeed happen.
So far so good, unaligned access is confirmed atomic on recent Intel processors and the evidence supports the documentation.
Unaligned access across the cache line
Crossing the cache line falls outside the scope of the prev. statement quoted so is not atomic unless we use the LOCK instruction (see under 8.1.2.2 Software Controlled Bus Locking):
So nothing is locked? that seemed wrong.
Certainly CAS is locked. CAS translates directly into LOCK CMPXCHG, so definitely locked, and yet the experiment definitely fails. This is as far as I got with JCStress, and this result raised some eyebrows. Re-reading through the code I wrote for JCStress didn't raise any suspicion, so I wrote a variation of the above test and still it looked like the CAS values are broken.
I have to admit I was happy to leave it at that, and in some ways it is not a wrong conclusion, but I met with some insistent doubt from Mr. Gil Tene of Azul. With both Gil and the Intel manual insisting this is not right I had a final go at cracking this contradiction and read through the assembly. The CAS writes are indeed as expected LOCK CMPXCHG, but the volatile read is:
And indeed, with this version CAS is proved to never deliver broken values, CAS is atomic across the cache line. But... as far as Java is concerned, this is only of use if you want to use CAS to test for the written values, which is not that useful. This highlights the fact that while CAS is implemented using LOCK CMPXCHG, it does not have the same interface. CMPXCHG returns the current value even on failure. That would have been exactly what we'd want to replace the volatile read in this particular and admittedly perverse case.
"The integrity of a bus lock is not affected by the alignment of the memory field. The LOCK semantics are followed for as many bus cycles as necessary to update the entire operand. However, it is recommend that locked accesses be aligned on their natural boundaries for better system performance... Locked operations are atomic with respect to all other memory operations and all externally visible events."Cool, so... which one is locked? One way to find out is run the test, so I did.... and... it looks like none...
So nothing is locked? that seemed wrong.
Certainly CAS is locked. CAS translates directly into LOCK CMPXCHG, so definitely locked, and yet the experiment definitely fails. This is as far as I got with JCStress, and this result raised some eyebrows. Re-reading through the code I wrote for JCStress didn't raise any suspicion, so I wrote a variation of the above test and still it looked like the CAS values are broken.
I have to admit I was happy to leave it at that, and in some ways it is not a wrong conclusion, but I met with some insistent doubt from Mr. Gil Tene of Azul. With both Gil and the Intel manual insisting this is not right I had a final go at cracking this contradiction and read through the assembly. The CAS writes are indeed as expected LOCK CMPXCHG, but the volatile read is:
0x0000000106b5d60c: movabs r10,0x7f92a38248fc
0x0000000106b5d616: mov r10,QWORD PTR [r10] ;*invokevirtual getLongVolatile
A MOV with no lock! So it is perhaps not the CAS that is broken, it is the volatile read. To prove the point I change the test for broken values to be:
And indeed, with this version CAS is proved to never deliver broken values, CAS is atomic across the cache line. But... as far as Java is concerned, this is only of use if you want to use CAS to test for the written values, which is not that useful. This highlights the fact that while CAS is implemented using LOCK CMPXCHG, it does not have the same interface. CMPXCHG returns the current value even on failure. That would have been exactly what we'd want to replace the volatile read in this particular and admittedly perverse case.
Now with the new atomic observation we can re-examine the atomicity of the ordered and volatile writers. And find, to our shared joy/sorrow that they are just as broken as before. The ordered write should come as no surprise as it is a plain MOV, but the volatile write is a bit surprising (to me if not to others. Gil for instance wasn't surprised, the clever bastard). It's a close call thing, you could implement a volatile write using a single locked instruction (LOCK XCHG), but the actual implementation is via a MOV followed by a LOCK ADD (see end of Martin's post and comments for discussion on that implementation choice). As things currently stand, cache line crossing volatile writes are not atomic.
Summary & Credits
Rules of unalignment thus far are:
- If you can at all help it, read and write aligned data.
- Unaligned writes and reads within the cache line are atomic on recent Intel processors, but not atomic on older models.
- Unaligned writes and reads across the cache line are not atomic unless they are locked.
- Of the available options, only CAS is locked. There is no locked way to read an arbitrary value.
- Volatile/Ordered writes are not atomic, volatile read is also not atomic.
The above is true for Intel x86, but I would not expect other processors to behave the same.
This post is really a summary of the combined efforts and ideas by Gil, Martin (thanks guys) and myself. Any errors are in all probability mine, please let me know if you find any. The test code is here, you can run with any offset, write type or read type by playing with the system properties, have a play and let me know your results. I'm particularly curious to hear from anyone running on non-intel processors who can share their results.
This post is really a summary of the combined efforts and ideas by Gil, Martin (thanks guys) and myself. Any errors are in all probability mine, please let me know if you find any. The test code is here, you can run with any offset, write type or read type by playing with the system properties, have a play and let me know your results. I'm particularly curious to hear from anyone running on non-intel processors who can share their results.