Psychosomatic, Lobotomy, Saw: Unsafe

Showing posts with label Unsafe. Show all posts

Monday, 3 November 2014

The Mythical Modulo Mask

It is an optimisation well known to those who know it well that % of power of 2 numbers can be replaced by a much cheaper AND operator to the same power of 2 - 1. E.g:
x % 8 == x & (8 - 1)
[4/11/2014 NOTE] This works because the binary representation for N which is a power of 2 will have a single bit set to 1 and (N-1) will have all the bits below that set to 1 (e.g 8 = 00001000, 8-1= 00000111). When we do x AND (N-1) only the remainder of x / N will match the N-1 mask.
[4/11/2014 NOTE + slight spoiler: this only works when x >= 0]
[15/02/2016 NOTE: It has been pointed out that % is not modulo, but remainder. Why everyone calls it modulo I'm not sure]
The reason the & is so much cheaper is because while % is implemented using the DIV instruction, & is just AND and as it turns out DIV is expensive and AND is cheap on x86 CPUs (and other places too I think). The optimisation is used in the Disruptor as well as the JCTools circular array queues and in the ArrayDequeue and other JDK classes. Is it time to replace % with & everywhere in your code which has this opportunity?
[4/11/2014 NOTE] Technical term for this sort of optimization is Strength Reduction.

Starting Simple

Lets start with some basic benchmarks:
And the results (on JDK8u5/E5-2697 v2 @ 2.70GHz/-XX:-UseCompressedOops for consistency between assembly and results):
Benchmark Score error Units
moduloLengthNoMask 3.448 ± 0.007 ns/op
moduloConstantLengthNoMask 1.150 ± 0.002 ns/op
moduloLengthMask 1.030 ± 0.006 ns/op
moduloMask 0.862 ± 0.001 ns/op
consume 0.719 ± 0.001 ns/op
noop 0.287 ± 0.000 ns/op

So pretty much as per expectation the modulo operation is far more expensive than the mask:

The clever JIT is aware of the optimisation opportunity and will replace a constant % with the &. It is not a perfect replacement, but pretty close.
At this sort of low digit ns benchmark we can’t make a statement such as “modulo is 4 times more expensive” because the same machine produces a baseline of 0.287ns/op for the noop benchmark and 0.719ns/op for the consume benchmark. If we deduct the consume result from the other scores we see a 1 : 25 ratio between the costs. Is that a good way to model performance? not really either, performance is not additive so simply subtracting one cost from the other doesn't really work at this scale. The truth is somewhere fuzzy in between and if we really care we should look at the assembly.
It seems that using a pre-calculated mask field is more awesome than using the "array length - 1" as a mask. That is consistent with the expectation that the re-calculation of the mask on the fly, as well as loading the value to be used for that calculation, is more expensive than using the pre-calculated field.

I love it when a plan comes together...

Going Deeper

The reason we wanted the modulo in the first place was to read from the array, right? so let’s try that:

And the results:

Benchmark Score error Units
readByLengthNoMask 3.736 ± 0.005 ns/op
readByConstantLengthNoMask 1.437 ± 0.001 ns/op
readByMask 1.347 ± 0.022 ns/op
readByLengthMask 1.181 ± 0.049 ns/op
readNoMask 1.175 ± 0.004 ns/op
Well, what’s this I see? "length-1" mask is leading the chart! How’d that happen?

To quote from the famous “Jack and the FlumFlum Tree”:

“Don’t get your knickers in a twist!” said Jack,
“Let’s have a look in the patchwork sack.”

Lets start with the generated assembly for the constant modulo:

I didna see that one coming! the modulo on a constant is not your garden variety & mask affair since it turns out our original assertion about the mask/modulo equality is only true for positive numbers. The JIT in it’s wisdom is dealing with the negative case by doing (x = -x; x = x&15; x = -x;).
I think the above case could be made a tiny bit faster by switching the branch around (so jump for negative value). It’s easy however to see what happens if we simplify the constant version further by using a constant mask:
And results:
Benchmark Score error Units
moduloConstantLengthNoMask 1.150 ± 0.002 ns/op
moduloConstantLengthMask 0.860 ± 0.001 ns/op
readByConstantLengthNoMask 1.437 ± 0.001 ns/op
readByConstantLengthMask 1.209 ± 0.017 ns/op
So minor joy on the modulo, and reading is better than plain mask, nearly as good as the "length-1" mask. Oh well, let's move on.
The big surprise was the mask calculated on the fly from the array length version. How can calculating the mask on the fly, which seemed to be slower, end up being faster when reading from the array? Who feels like more assembly?
I was hoping the JVM was clever enough to remove the array bound checks, but that didn’t happen. What’s happening here is that the length load serves the purpose of both creating the mask and checking the bounds. This is not the case for the mask version where we load the mask for the index calculation and the length for the bounds check, thus paying for 2 loads instead of one:
So removing the computation did not make a difference because the bound check requires the extra load of the length anyhow, can we make the bounds check go away? Of course we can, but it’s Unsafe!!! Let’s do it anyways!

The assembly:

Shazzam! no bounds check, but look at all the work that’s gone into the unsafe read of the array. It would have been so much better if the unsafe read enjoyed the same addressing mode as normal array reads like so “r8d,DWORD PTR [r9+r10*4+0x18]”, but it seems the JIT compiler is not recognising the opportunity here. What’s the performance like?
Benchmark Score error Units
readByMask 1.347 ± 0.022 ns/op
readByLengthMask 1.181 ± 0.049 ns/op
readNoMask 1.175 ± 0.004 ns/op
unsafeReadByMask 1.152 ± 0.001 ns/op

This is even better than no mask at all. Yay?

Well… sort of. If you mean to have the fastest ‘get’ from the array that allows for an array size which is not an application constant, than this is a mini-win. In particular is saves you a load of the array length in this case and loads can cost anything really. In the case where index and mask are long we can get better code generated:

But performance is much the same for this case. Seems like there’s not much left to win in this case.

For completeness sake we can compare the no mask result with an Unsafe equivalent:

Benchmark Score error Units
unsafeReadByNoMask 1.038 ± 0.022 ns/op

readNoMask 1.175 ± 0.004 ns/op

So it seems slipping past the array boundary check is worth something, but is it generally worth it? what if we weren't dealing with just the one element?

Bound Check Elimination

Looking at the above optimisation we need to accept that it is probably only worth it if array bound checks happen on every access. If we now compare a sum over an array:

We get the following results (length=100):

Benchmark Score error Units
loopOverArrayStraight 26.855 ± 0.060 ns/op
loopOverArrayUnsafeInt 41.413 ± 0.056 ns/op
loopOverArrayUnsafeLong 76.257 ± 0.171 ns/op
Oh Unsafe, why you so sucky sucky? How come the unsafe versions suck so significantly? isn’t Unsafe the cure to all performance problems?
Once the bounds check is eliminated by the JIT we can see that for the UnsafeInt we have the same issue with addressing conversion, only now the cost is not compensated for by the bound check removal. The UnsafeLong version is even worse, how come?
The generated loop for the int case is long and boring because it’s unrolled, the long case is pretty small:
2 'bad' things just happened:

Addressing didn’t workout the way we’d like. Instead of the desired “mov r11d,DWORD PTR [r9+rdi*4+0x18]” we get a two stage setup where we do:”lea r10,[r9+rdi*4]” and then “add r11d,DWORD PTR [r10+0x18]”. Bummer.
We got a safe point poll in the loop. This is happening because long indexed loops are considered potentially very long (as opposed to shorter int loops... heuristics for time to safe point) and so include a safe point poll.

So we want to fix the addressing mode and stick to having an int index. If we were to insist on using Unsafe (perhaps because we are trying to do this with off heap memory instead of an array) we’d have to do this:
[4/11/2014 NOTE] Note that what we really want here is more than just getting rid of the multiplication/widening, we want the JIT to identify the expression calculated for offset as relative array access and pick the correct addressing mode for MOV to use. There are clever people out there trying to make sure this will work better in the future.
This removes the need for a safe point poll and simplifies addressing to the point where we nearly match the iteration over the array case (length=100):

Benchmark Score error Units
loopOverArrayStraight 26.855 ± 0.060 ns/op

  loopOverArrayUnsafePointer   27.377 ± 0.049  ns/op

We can explore the relationship between the implementations by testing for different array sizes:

            10   100     1000    10000

straight    4.3  26.8    289.2   2883.7

unsafeP     4.8  27.3    296.1   2886.4

So it seems that the smaller the array the more relative advantage the array iteration has when iterating in this fashion. This should not really be surprising, there's nothing here to confuse the JIT compiler and iterating over arrays is important enough to optimize. We have to work hard to get close to the JIT compiler when it does what it does best.

Summary

We had a simple optimisation in mind, replace a % with &:

Observed that for the case where constants are used the JIT is able to perform that optimisation for us almost as well as we’d do ourselves (we have no way of specifying positive only modulo, i.e uint).
We proved the viability of the optimisation in 2 variations, using a pre-calculated mask field and using (array.length - 1)
Using the optimisation in the context of a circular array read showed an interesting reversal in performance. We observed the cause of this reversal to be the array.length load for the purpose of bound checks reused for the calculated mask as opposed to the re-calculated.
Using Unsafe we managed to bypass the array bound check and get the best result using the mask for a single read.
When we try the same method naively in a loop (over the whole array) array bound check is eliminated and plain old array access is the best performer.
To regain the performance for Unsafe access we have to tweak the code to avoid safe point polling as well as to get the addressing mode we want in the resulting assembly. Even then plain array access is better for smaller arrays.

Simple innit?
Some notes on methodology:

I ran the same experiments on different intel generations, you get different results but the assembly remains the same. E.g. on older CPUs the maximum instructions per cycle would be less than on the Ivy Bridge CPU I've used here, this will lead to instruction spilling over to the next cycle. The L1 latency could be higher leading to loads dominating the costs etc. This ends up giving a slightly different balance to compute vs. memory load. Overall analysis holds.
Using -XX:-UseCompressedOops was done for the sake of consistent assembly and results. Using compressed oops makes loads look a bit clumsier and I wanted to have less to explain. But running with the flag on (as it is by default) also effects results on this scale. In particular because the compressed oops require a shift to be used and shifters are a limited resource the CPU (1 on Westmere, 2 on Ivy Bridge) it can end up adding a cycle to the results.
Running these same experiments on a laptop was good for getting the assembly out and a vague sense of scale for results, but measurements had far greater error in that environment. Also note that laptops and desktops tend to be a generation ahead of servers where processors are concerned.
An interesting experiment would be to look at same experiment with the JMH perfasm profiler. I did that but could not figure out how to get Intel syntax out of it and so for consistency sake stuck with what I had. Left as an exercise to the reader :P

Many thanks to J.P. Bempel and Peter Hughes for reviewing, any issues remaining were added by me after they reviewed the post.

Tuesday, 25 February 2014

Unsafe Pointer Chasing: Running With Scissors

Love running? Love scissors? I know just the thing for you! Following on from recent discussion on the Mechanical Sympathy mailing list I see an anti pattern worth correcting in the way people use Unsafe. I say correcting as I doubt people are going to stop, so they might as well be made aware of the pitfalls. This pattern boils down to a classic concurrency bug:

Q: "But... I not be doing no concurrency or nuffin' guv"
A: Using Unsafe to gain a view of on-heap addresses is concurrent access by definition.

Unsafe address: What is it good for?

Absolutely nothing! sayitagain-huh! I exaggerate, if it was good for nothing it would not be there, let's look at the friggin manual:

As we can see the behaviour is only defined if we use the methods together, and by that I mean that get/putAddress are only useful when used with an address that is within a block of memory allocated by allocateMemory. Now undefined is an important word here. It means it might work some of the time... or it might not... or it might crash your VM. Let's think about this.

Q: What type of addresses are produced by allocateMemory?

A: Off-Heap memory addresses -> unmanaged memory, not touched by GC or any other JVM processes

The off-heap addresses are stable from the VM point of view. It has no intention of running around changing them, once allocated they are all yours to manage and if you cut your fingers in the process or not is completely in your control, this is why the behaviour is defined. On-Heap addresses on the other hand are a different story.

Playing With Fire: Converting An Object Ref to An Address

So imagine you just had to know the actual memory address of a given instance... perhaps you just can't resist a good dig under the hood, or maybe you are concerned about memory layout... Here's how you'd go about it:

Now... you'll notice the object ref needs a bit of cuddling to turn into an address. Did I come up with such devilishly clever code myself? No... I will divulge a pro-tip here:

If you are going to scratch around the underbelly of the JVM, learn from as close to the JVM as you can -> from the JDK classes, or failing that, from an OpenJDK project like JOL (another Shipilev production)

In fact, the above code could be re-written to:
Now that we have the address what can we do with it? Could we use it to copy the object? maybe we could read or modify the object state? NO! we can but admire it's numerical beauty and muse on the temperamental values waiting at the other end of that address. The value at the other end of this address may have already been moved by GC...

Key Point: On-Heap Addresses Are NOT Stable

Consider the fact that at any time your code may be paused and the whole heap can be moved around... any address value you had which pointed to the heap is now pointing to a location holding data which may be trashed/outdated/wrong and using that data will lead to a funky result indeed. Also consider that this applies to class metadata or any other internal accounting managed by the JVM.
If you are keen to use Unsafe in the heap, use object references, not addresses. I would urge you not to mix the 2 together (i.e. have object references to off-heap memory) as that can easily lead to a very confused GC trying to chase references into the unknown and crashing your VM.

Case Study: SizeOf an Object (Don't do this)

This dazzling fit of hackery cropped up first (to my knowledge) here on the HighScalability blog:
This is some sweet macheta swinging action :-). The dude who wrote this is not suggesting it is safe, and only claims it is correct on a 32bit VM. And indeed, it can work and passes cursory examination. The author also states correctly that this will not work for arrays and that with some corrections this can be made to work for 64 bit JVMs as well. I'm not going to try and fix it for 64 bit JVMs, though most of the work is already done in the JOL code above. The one flaw in this code that cannot be reliably fixed is that it relies on the native Klass address (line 6) to remain valid long enough for it to chase the pointer through to read the layout helper (line 8). Spot the similarity to the volatile bug above?
This same post demonstrates how to forge references from on-heap objects to off-heap 'objects' which in effect let you cast a pointer to a native reference to an object. It goes on to state that is a BAD IDEA, and indeed it can easily crash your VM when GC comes a knocking (but it might not, I didn't try).

Case Study: Shallow Off-Heap Object Copy (Don't do this)

Consider the following method of making an off-heap copy of an object (from here, Mishadof's blog):
We see the above is using the exact same method for computing size as demonstrated above. It's getting the on-heap object address (limited correctness, see addresses discussion above) than copying the object off-heap and reading it back as a new object copy... Calling the Unsafe.copyMemory(srcAddress, destAddress, length) is inviting the same concurrency bug discussed above. A similar method is demonstrated in the HighScalability post, but there the copy method used is Unsafe.copyMemory(srcRef, srcOffset, destRef, destOffset, length). This is important as the reference using method is not exposed to the same concurrency issue.
Both are playing with fire ofcourse by converting off-heap memory to objects. Imagine this scenario:

a copy of object A is made which refers to another object B, the copy is presented as object C
object A is de-referenced leading to A and B being collected in the next GC cycle
object C is still storing a stale reference to B which is no managed by the VM

What will happen if we read that stale reference? I've seen the VM crash in similar cases, but it might just give you back some garbage values, or let you silently corrupt some other instance state... oh, the fun you will have chasing that bugger down...

Apologies

I don't mean to present either of the above post authors as fools, they are certainly clever and have presented interesting findings for their readers to contemplate without pretending their readers should run along and build on their samples. I have personally commented on some of the code on Mishadof's post and admit my comments were incomplete in identifying the issues discussed above. If anything I aim to highlight that this hidden concurrency aspect can catch out even the clever.

Finally, I would be a hypocrite if I told people not to use Unsafe, I end up using it myself for all sorts of things. But as Mr. Maker keeps telling us "Be careful, because scissors are sharp!"

Monday, 29 July 2013

Atomicity of Unaligned Memory Access in Java

This is an attempt to clarify and improve on previous statements I've made here (and before that here) on unaligned memory access and it's implications. Let's start from the top:

Unaligned access is reading/writing a short/int/long from/to an address that is not divisible by it's size.
Unaligned access to data on the heap is not possible without using sun.misc.Unsafe.
Unaligned access to data off the heap is possible via direct ByteBuffer.
Unaligned access to data is very possible using unsafe both on/off heap.
Unaligned access is therefore bad news even for good boys and girls using direct ByteBuffer, as the result on unaligned access are architecture/OS specific. You may find that your code crashes inexplicably on some processes when running some OSs.
Unaligned access atomicity pure and simple is not a JVM issue. You can only "legally" do it by using direct ByteBuffer, and that is explicitly not a thread safe object so you are in no position to expect atomicity (As an aside, in heap ByteBuffers all writes are not atomic)
What happens on unaligned access stays on unaligned access ;-)

So unaligned access is a problem for the few, and concurrent unaligned access is a problem for naughty, tricksy hobbitses who want to play with matches. Such devils may want to write an IPC, or use off heap memory to store structs, and are hoping to get coherent results. Here's my results thus far for x86 processors. Skip to the summary at the end if you don't care for the journey.

Unaligned access within the cache line

Unaligned access within the cache line is not atomic on older Intel processors, but is atomic on later models (from the intel developer manual under 8.1.1 Guaranteed Atomic Operations):

"The P6 family processors (and newer processors since) guarantee that the following additional memory operation will always be carried out atomically:

Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit within a cache line"

To demonstrate this time around I'll use a simpler means than JCStress and just roll my own, the same tests are available on my fork of JCStress here. The same test will be used for the cross line test, and it covers all 3 flavours of visible writes (ordered/volatile,CAS):
We expect the test to run forever (or as long as we need to feel comfortable) if the writes are atomic. If they are not we expect the test to write "WTF" and exit early. Running the test for all write variations with an offset of 12 leads to no breakage of atomicity when run on a Core2Duo and Xeon, this test should break (i.e exit early, WTF) on older processors and this result has been confirmed to indeed happen.
So far so good, unaligned access is confirmed atomic on recent Intel processors and the evidence supports the documentation.

Unaligned access across the cache line

Crossing the cache line falls outside the scope of the prev. statement quoted so is not atomic unless we use the LOCK instruction (see under 8.1.2.2 Software Controlled Bus Locking):

"The integrity of a bus lock is not affected by the alignment of the memory field. The LOCK semantics are followed for as many bus cycles as necessary to update the entire operand. However, it is recommend that locked accesses be aligned on their natural boundaries for better system performance... Locked operations are atomic with respect to all other memory operations and all externally visible events."

Cool, so... which one is locked? One way to find out is run the test, so I did.... and... it looks like none...
So nothing is locked? that seemed wrong.
Certainly CAS is locked. CAS translates directly into LOCK CMPXCHG, so definitely locked, and yet the experiment definitely fails. This is as far as I got with JCStress, and this result raised some eyebrows. Re-reading through the code I wrote for JCStress didn't raise any suspicion, so I wrote a variation of the above test and still it looked like the CAS values are broken.
I have to admit I was happy to leave it at that, and in some ways it is not a wrong conclusion, but I met with some insistent doubt from Mr. Gil Tene of Azul. With both Gil and the Intel manual insisting this is not right I had a final go at cracking this contradiction and read through the assembly. The CAS writes are indeed as expected LOCK CMPXCHG, but the volatile read is:

0x0000000106b5d60c: movabs r10,0x7f92a38248fc
0x0000000106b5d616: mov r10,QWORD PTR [r10] ;*invokevirtual getLongVolatile

A MOV with no lock! So it is perhaps not the CAS that is broken, it is the volatile read. To prove the point I change the test for broken values to be:

And indeed, with this version CAS is proved to never deliver broken values, CAS is atomic across the cache line. But... as far as Java is concerned, this is only of use if you want to use CAS to test for the written values, which is not that useful. This highlights the fact that while CAS is implemented using LOCK CMPXCHG, it does not have the same interface. CMPXCHG returns the current value even on failure. That would have been exactly what we'd want to replace the volatile read in this particular and admittedly perverse case.

Now with the new atomic observation we can re-examine the atomicity of the ordered and volatile writers. And find, to our shared joy/sorrow that they are just as broken as before. The ordered write should come as no surprise as it is a plain MOV, but the volatile write is a bit surprising (to me if not to others. Gil for instance wasn't surprised, the clever bastard). It's a close call thing, you could implement a volatile write using a single locked instruction (LOCK XCHG), but the actual implementation is via a MOV followed by a LOCK ADD (see end of Martin's post and comments for discussion on that implementation choice). As things currently stand, cache line crossing volatile writes are not atomic.

Summary & Credits

Rules of unalignment thus far are:

If you can at all help it, read and write aligned data.
Unaligned writes and reads within the cache line are atomic on recent Intel processors, but not atomic on older models.
Unaligned writes and reads across the cache line are not atomic unless they are locked.
Of the available options, only CAS is locked. There is no locked way to read an arbitrary value.
Volatile/Ordered writes are not atomic, volatile read is also not atomic.

The above is true for Intel x86, but I would not expect other processors to behave the same.
This post is really a summary of the combined efforts and ideas by Gil, Martin (thanks guys) and myself. Any errors are in all probability mine, please let me know if you find any. The test code is here, you can run with any offset, write type or read type by playing with the system properties, have a play and let me know your results. I'm particularly curious to hear from anyone running on non-intel processors who can share their results.

Wednesday, 24 July 2013

Java Concurrency Torture Update: Don't Stress

A piece of late news: the concurrency suite is back. And so is my little test on top of it.
A few months ago I posted on yet another piece of joyful brilliance crafted by Master Shipilev, the java-concurrency-suite. I used the same framework to demonstrate some valid concerns regarding unaligned access to memory, remember children:

Unaligned access is not atomic
Unaligned access is potentially slow, in particular:

On older processors
If you cross the cache line

Unaligned access can lead to SEG_FAULT on non-intel architectures, and the result of that might be severe performance hit or your process crashing...

It was so much fun that someone had to put a stop to it, and indeed they did.
The java-concurrency-suite had to be removed from github, and the Master Inquisitor asked me to stop fucking around and remove my fork too, if it's not too much trouble... so I did.
Time flew by and at some point the powers that be decided the tool can return, but torture was too much of a strong word for those corporate types and so it has been rebranded JCStress. JC is nothing to do with the Jewish Community, as some might assume, it is the same Java Concurrency but now it's not torture (it's sanctioned after all), it's stress. Aleksey is simply stressing your JVM, he is not being excessively and creatively sadistic, that would be too much!
Following the article I had a short discussion with Mr T and Gil T with regards to unaligned access in the comments to Martin's blog post on off-heap tuple like storage. The full conversation is long, so I won't bore you, but the questions asked were:

Is unaligned access a performance only, or also a correctness issue?
Are 'volatile' write/reads safer than normal writes/reads?

The answers to the best of my understanding are:

Unaligned access is not atomic[Correction 23/09/2013: On later Intel processor unaligned access within the cache line is atomic, but access across the line is not. See update below and related later post], and therefore can lead to the sort of trouble you will not usually experience in your programs (i.e half written long/int/short values). This is a problem even if you use memory barriers correctly. It adds a 'happened-badly' eventuality to the usual happens before/after reasoning and special care must be taken to not get slapped in the face with a telephone pole.
Volatile read/writes are NOT special. In particular doing a CAS[Correction 23/09/2013: CAS is atomic, all others are not. See update below and related later post]/putOrdered/putVolatile write to an unaligned location such that the value is written across the cache line is NOT ATOMIC.

This came up in a discussion recently, which prompted me to go and fork JCStress on to bitbucket and rerun the experiments on both a Core2Duo and a Xeon processor and re-confirmed the result. This is a problem for folks considering the Atomics for ByteBuffer suggestion discussed on the concurrency-interest mailing list. There is nothing in the ByteBuffer interface to stop you from doing unaligned access...

UPDATE (1/08/2013): Shipilev's slides from JVMLS lightning talk.

UPDATE (23/09/2013): I've discussed the issue further with Gil and Martin, which led to the following post. Some corrections to the above observations have been made which I now inlined above.

Tuesday, 21 May 2013

Know Thy Java Object Memory Layout

Sometimes we want to know how an object is laid out in memory, here's 2 use cases and 2 solutions which remove the need to hypothesise.
A few months back I was satisfying my OCD by reading up on java object memory layout. Now Java, as we all know and love, is all about taking care of such pesky details as memory layout for you. You just leave it to the JVM son, and don't lose sleep over it.
Sometimes though... sometimes you do care. And when you do, here's how to find out.

In theory, theory and practice are the same

Here's an excellent article from a few years back which tells you all about how Java should layout your object, to summarise:

Objects are 8 bytes aligned in memory (address A is K aligned if A % K == 0)
All fields are type aligned (long/double is 8 aligned, integer/float 4, short/char 2)
Fields are packed in the order of their size, except for references which are last
Classes fields are never mixed, so if B extends A, an object of class B will be laid out in memory with A's fields first, then B's
Sub class fields start at a 4 byte alignment
If the first field of a class is long/double and the class starting point (after header, or after super) is not 8 aligned then a smaller field may be swapped to fill in the 4 bytes gap.

The reasons why the JVM doesn't just plonk your fields one after the other in the order you tell it are also discussed in the article, to summarise:

Unaligned access is bad, so JVM saves you from bad layout (unaligned access to memory can cause all sorts of ill side effects, including crashing your process on some architectures)
Naive layout of your fields would be wasting memory, the JVM reorders fields to improve the overall size of your object
JVM implementation requires types to have consistent layout, thus requiring the sub class rules

So... nice clear rules, what could possibly go wrong?

False False Sharing Protection

For one thing, the rules are not part of the JLS, they are just implementation details. If you read Martin Thompson's article about false sharing you'll notice Mr. T had a solution to false sharing which worked on JDK 6, but no longer worked on JDK 7. Here are the 2 versions:

It turns out the JVM changed the way it orders the fields between 6 and 7, and that was enough to break the spell. In fairness there is no rule specified above which requires the fields order to correlate to the order in which they were defined, but ... it's allot to worry about and it can trip you up.

Just as above rules were still fresh in my mind, LMAX (who kindly open sourced the Disruptor) released the Coalescing Ring Buffer. I read through the code and came across the following:

I approached Nick Zeeb on the blog post which introduced the CoalescingRingBuffer and raised my concern that the fields accessed by the producer/consumer might be suffering from false sharing, Nick's reply:

I’ve tried to order the fields such that the risk of false-sharing is minimized. I am aware that Java 7 can re-order fields however. I’ve run the performance test using Martin Thompson’s PaddedAtomicLong instead but got no performance increase on Java 7. Perhaps I’ve missed something so feel free to try it yourself.

Now Nick is a savvy dude, and I'm not quoting him here to criticise him. I'm quoting him to show that this is confusing stuff (so in a way, I quote him to comfort myself in the company of others equally confused professionals). How can we know? here's one way I thought of after talking to Nick:

Using Unsafe I can get the field offset from the object reference, if 2 fields are less than a cache line apart they can suffer from false sharing (depending on the end location in memory). Sure, it's a bit of a hackish way to verify things, but it can become part of your build so in the case of version changes you on't get caught out.

Note that it's the runtime which will determine the layout, not the compiler, so if you build on version X and run on Y it won't help you much.

Enough of that false sharing thing... so negative... why would you care about memory layout apart from false sharing? Here's another example.

The Hot Bunch

Through the blessings of the gods, at about the same time LMAX released the CoalescingRingBuffer, Gil Tene (CTO of Azul) released HdrHistogram. Now Gil is seriously, awesomely, bright and knows more about JVMs than most mortals (here's his InfoQ talk, watch it) so I was keen to look into his code. And what do you know, a bunch of hot fields:

What Gil is doing here is good stuff, he's trying to get relevant fields to huddle together in memory, which will improve the likelihood of them ending up on the same cache line, saving the CPU a potential cache miss. Sadly the JVM has other plans...

So here is another tool to help make sense of your memory layout to add to your tool belt: Java Object Layout I bumped into it by accident, not while obsessing about memory layout at all. Here's the output for Histogram:

Note how histogramData jumps to the botton and subBucketMask is moved to the top, breaking up our hot bunch. The solution is ugly but effective, move all fields but the hot bunch to an otherwise pointless parent class:

And the new layout:

Joy! I shall be sending Mr. Tene a pull request shortly :-)

UPDATE 16/01/2014: The excellent JOL has now been released under OpenJDK here. It's even better than before and supports many a funky feature (worthy of a separate post). I've updated the links to point to the new project. Also check out Shipilev's blog post on heap dumps demonstrating the use of this tool.

Sunday, 7 April 2013

135 Million messages a second between processes in pure Java

{This post is part of a long running series on lock free queues, checkout the full index to get more context here}
Porting an existing single producer/single consumer concurrent queue into an IPC mechanism via memory mapped files and getting 135 million messages throughput in pure Java.
In my previous post I covered a single producer/consumer queue developed and shared by Martin Thompson capable of delivering an amazing 130M messages per second. The queue he delivered is a great tool for communicating between threads, but sometimes communicating between threads is not enough. Sometime you need to leave your JVM and go out of process. Inter Process Communications (IPC) is a different problem to inter thread communications, can it be cracked by the same approach?

IPC, what's the problem?

Inter Process Communication is an old problem and there are many ways to solve it (which I will not discuss here). There are several attractions to specialized IPC solutions for Java:

Faster than socket communication.
An out of process integration option with applications written in other languages.
A means of splitting large VMs to smaller ones improving performance by allowing GC and JIT specialization.

For IPC to be attractive it has to be fast, otherwise you may as well go for network based solutions which would extend beyond your local machine uniformly. I attended an Informatica conference a while back and got talking to Todd Montgomerey about the Disruptor and mechanical sympathy, he suggested that IPC should be able to perform as well as inter thread messaging. I found the idea interesting and originally meant to port the Disruptor, but Martin's queue is simpler (and quicker) so I went for that instead. Starting with a good algorithm/data structure is very good indeed, now I just needed to bridge the gap and see if I can maintain the benefits.

Off the heap we go!

To do IPC we must go off heap. This has several implications for the queue, most importantly references are not supported. Also note persistence to and from the queue is required, though one could extend my implementation to support a zero copy interaction where a struct is acquired, written and committed instead of the offer method, and similarly acquired, read and finally released instead of the poll method. I plan to make several flavours of this queue to test out these ideas in the near future.

My IPC queue uses a memory mapped file as a means of acquiring a chunk of shared memory, there is no intention to use the persisted values though further development in that direction may prove interesting to some. So now that I got me some shared memory, I had to put the queue in it.

I started by laying out the queue counters and cached counters. After realizing the counters need to be aligned to work properly I learnt how to align memory in Java. I went on to verify that aligned memory offers the guarantees required for concurrent access. Quick summary:

aligned access means writing data types to addresses which divide by their size.
unaligned access is not atomic, which is bad for concurrency :(
unaligned access is slow, which is bad for performance :(
unaligned access may not work, depending on OS and architecture. Not working is very bad :(

Sorting out alignment is not such a big deal once you know how it works. One of the nice things about going off-heap was that solving false sharing has become far more straightforward. Move your pointer and you're in the next cache line, job done. This left me rather frustrated me with the tricks required to control memory layout in Java. Going back to the original implementation you will notice the Padded classes who's role it is to offer false sharing protection. They are glorious hacks (with all due respect) made necessary by this lack of control. The @Contended annotation coming in JDK 8 will hopefully remove the need for this.
This is how the memory layout worked out:

To illustrate in glorious ASCII graphics (each - is a byte), this is what the memory layout looks like when broken into cache lines:
|--------|--------|--------|head....|--------|--------|--------|--------|
|--------|--------|--------|tailCach|--------|--------|--------|--------|

|--------|--------|--------|tail----|--------|--------|--------|--------|

|--------|--------|--------|headCach|--------|--------|--------|--------|

I played around with mixing off heap counters with on heap buffer but in the interest of brevity I'll summarize and say the JVM does not like that very much and the end result performance is not as good as all heap/off-heap solutions. The code is available with everything else.

Once alignment and memory layout were sorted I had to give up the flexibility of having reference pointers and settle for writing my data (an integer) directly into the memory. This leaves my queue very restrictive in it's current form. I intend to revisit it and see what I can do to offer a more extendable API on top of it.

Let me summarize the recipe at this point:

Create a memory mapped file large enough to hold:

4 cache lines for counters/cached counters.
4 bytes(per integer) * queue capacity (must be a power of 2).
1 spare cache line to ensure you can align the above to the cache line.

Get a mapped byte buffer, which is a direct byte buffer on top of the mapped memory.
Steal the address and get the contained aligned byte buffer.
Setup pointers to the counters and the beginning of the buffer
Replace use of natural counters with off heap counters accessed via Unsafe using the pointers.
Replace use of array with use of offset pointers into buffer and Unsafe access.
Test and debug until you work out the kinks...

The above code should give you a fair idea how it works out and the rest is here. This queue can work in process and out of process as demonstrated in the tests included in the repository. Now that it works (for the limited use case, and with room for further improvement... but works), is it fast enough? not so fast? is it...<gasp> ... FASTER????!?!?!

Smithers, release the hounds

Here are the numbers for using the different implementations in process:

Implementation/Affinity	Same core	Cross core	Cross socket
P1C1QueueOriginal3	110M	130M	19M
P1C1OffHeapQueue	130M	220M	200M
P1C1QueueOriginalPrimitive	124M	220M	215M

Confused? Let me explain. First line is the measurements taken for the original queue. Similar to what was presented in prev. post, though I saw a slight improvement in the results with increasing the compile threshold to 100000.
The second line is my offheap implementation of same algorithm. It is significantly faster. This is not IPC yet, this is in process. The reason it is faster is because data is inlined in the queue, which means that by loading an entry in the queue we get the data as opposed to a reference to the data. Getting a reference is what you get when you have and Object[] array. The array holds the references and the data is elsewhere, this seems to make it more painful as we get further from the producer.
The last entry is a mutation of P1C1QueueOriginal3 into a primitive array backed queue to compare performance like for like. As you can see this displays very similar results to the off heap implementation supporting the theory that data in-lining is behind the observed performance boost.
The lesson here is an old one, namely that pointer chasing is expensive business further amplified by the distance between the producing CPU and consuming CPU.
The off-heap queue can offer an alternative to native code integration as the consuming thread may interact directly with the off-heap queue and write results back to a different off-heap queue.
Running a similar benchmark adapted to use a memory mapped file as the backing DirectByteBuffer for the off-heap queue we get:
    same core - ops/sec=135M
    across cores - ops/sec=98M
    across sockets - ops/sec=25M

JOY! a pure Java IPC that gives you 135M messages per second is more throughput then you'd get with most commercial products out there. This is still not as fast as the same queue in process and I admit I'm not sure what the source of the performance difference is. Still I am quite happy with it.
A few notes/observations from the experimentation process:

I got a variety of results, stabilizing around different average throughputs. I chose the best for the above summary and plan to go into detail about the results in the near future.
The JVM was launched with: -XX:+UseCondCardMark -XX:CompileThreshold=100000
Removing the Thread.yield from the producer/consumer loops improved performance when running on the same core, but made it worse otherwise.
Moving the queue allocation into the test loop changes the performance profile dramatically.
I've not had time to fully explore the size of the queue as a variable in the experiment but the little I've done suggests it makes a difference, choose the right size for your application.

I realize this post is rather less accessible than the previous one, so if you have any questions please ask.

Tuesday, 12 February 2013

Alignment, Concurrency and Torture (x86)

Summary: Exploring the effects of unaligned/aligned concurrent access in Java for off/on heap memory, how to test for correctness, torture and cushions.

Concurrent access to memory is a pain at the best of times. The JMM offers Java programmers a way to reason about concurrency in a platform independent manner, and while not without fault it's certainly better then having no memory model at all. Other languages are catching up of course, but the focus of this article is not comparative concurrency, so look it up yourselves ;).
When we go off heap via Unsafe or Direct Byte Buffers, or when we utilize Unsafe to gain direct access to on heap memory we are on our own. Direct memory access strips away the JVM guarantees and leaves you exposed to the underlying OS/architecture memory access rules and the interpretation of the JVM runtime.
Should you be worried?
Silly question.
If you are not the worried type, follow this link with my blessings.
Worried? Here's 2 guarantees now gone to worry about:

Atomicity - memory access on the JVM is guaranteed to be atomic, even when the underlying system does not support it. No such similar guarantees are available for ByteBuffers. In fact ByteBuffers are explicitly not thread safe, so atomicity is not guaranteed at all. Unsafe has no guarantees for anything, so caution is advised.
Word tearing - this means concurrent updates to adjacent bytes in memory do not result in corruption of state. Some processors perform memory writes in units of a word, which can be 2 or 4 bytes. This can result in the unmolested parts of the word overwriting bytes written to by other threads. While the JLS is binding for 'plain' memory access (via assignment) the guarantee does not necessarily extend to off heap memory.

The problem with verifying these behaviours is that concurrency guarantees and issues are by definition timing dependent. This means that while most of the time there is no issue, that is not to say that throwing a few extra processors in, or changing memory layout, or maybe just waiting a bit longer will not trigger the issue...
So how do you prove your code is thread safe once you stepped off heap and into the wild? I recently answered a question on stack overflow regarding testing for correctness under multi-threaded access of a data structure. My answer was accepted, so obviously it's correct, I cut out the niceties to keep this short:

1. Formally prove the correctness of your program in the terms of the JMM.

2. Construct test cases which demonstrate intended above behaviour by using count down latches or other means of suspending threads of execution at specific points in your program to force contention.

3. Statistically demonstrate correctness by exercising the code over a sufficient period of time from multiple threads.

This is all fine for when you write 'safe' Java, but option 1, and therefore 2 go out the window when you are straying off spec. This is an important observation to make: if the rules are not well defined then testing the edge cases for those rules is not that meaningful.
Which leaves us with number 3.

Science is torture

It is a fact well known to those who know it well that a scientific experiment can only prove your theory is either definitely wrong or so far right. This (limited) power of induction from past experiences is what empiricism is all about. To prove JVM implementations implement the JMM correctly Mr. Shipilev (Benchmarking Tzar) has put together the Java Concurrency Torture Suite into which he drags JVMs and tries to break them on any number of architectures (not to worry, he tries it on cute bunnies first). It's a great project for anyone trying to reason about the JMM as it gives you examples in code of edge cases and the corresponding range of expected behaviours on all architectures. It also means that if you got questionable behaviours you want to explore on a particular architecture Aleksey just saved you allot of work.
The JCTS project is a very professional piece of work, so I'll try not to mince words over what you can read in the readme. The general approach is that if you leave it (the concurrent behaviour) long enough the edge cases will come crawling out of the woodwork. Here's one bug found using it, the discussion in the ticket is quite interesting.
The tests run also report how many times each state was detected, which has the added benefit of educating us on the distribution of occurrences for the different cases. The next version may support ASCII bar charts too. Instead of yapping on about it, lets see how atomicity is tackled and all will become clear.

Biggles! Fetch...THE CUSHIONS!

The JCTS already has atomicity tests, I just had to my own to target unaligned access. I added 2 tests, one for unaligned access within the line and one for crossing the cache line, for brevity this is just the long test:
The method for allocating aligned direct memory byte buffers is described in this post. The cross cache line case is very similar with the position being set intentionally to cause cache straddled access.
I expected the straddled line access to be non-atomic and for once I was right(read on, it doesn't last)!!!
Note the fine printed output, a joy! The test run several times and we can see how in the majority of cases things are fine, the observed result is either 0 or -1. But there are other values which should not be there (when our mother is not), those observed values are the result of the lack of atomicity. Reading the value in some undefined trashed state gives us... mmm... trash. As I suspected! Patting myself on the shoulder I moved on to the unaligned non-straddling tests.
The results suggested that unaligned access inside the cache line is atomic on my machine. I admit it was not what I thought. This is a bit surprising. Good old Intel, they can really make this work!

Our chief weapon is surprise

Going through these experiments and results reminded me of a comment made by Gil Tene (Azul CTO) on Martin Thompson's blog:

Cache alignment is only a performance issue. Never a correctness issue. Cache lines do not add or remove atomicity. Specifically, non atomic updates within a single cache line can still be seen non-atomically by other threads.The Java spec only guarantees atomic treatment (as in the bits in the field are read and written all-or-nothing) for boolean, byte, char, short, int, and float field types. The larger types (long and double) *may* be read and written using two separate operations (although most JVMs will still perform these as atomic, all-or-nothing operations).
BTW, all x86 variants DO support both unaligned data types in memory, as well as LOCK operations on such types. This means that on an x86, a LOCK operations that spans boundary between two cache lines will still be atomic (this little bit of historical compatibility probably has modern x86 implementors cursing each time).

I read the above to suggest Gil was thinking atomicity is a given across the cache line, so I sent him an e-mail to discuss my findings. Now Gil has been under the JVM hood more than most people, and this turned out to be quite enlightening:

I'm not saying that cross cache line boundaries cannot induce non-atomicity to occur more often. I *am* saying that whenever you see non-atomicity on accesses that cross cache line boundaries, that same non-atomicity potential was there [for the same code] even within a single cache line, and is usually there regardless of whether or not the in-same-cache-line access is aligned to the type size boundary.

He went on to point out that the whole experimentation as means to proving correctness is not possible (I told you it's well known) and that where the spec. doesn't promise you anything, expect nothing:

... there is no way to verify atomicity by experimentation. You can only disprove it, but cannot prove it.

The only way you can control atomicity on a known x86 arch. for off heap access would be to control ALL x86 instructions used to access your data, where knowing exactly which instructions are used and what the layout restrictions you impose would allow you to prove atomicity through known architectural qualities.

I didn't expect a kind of Spanish Inquisition

Mmmm... but how would I know? Given there are no guarantees on the one hand and the limitations of experimentation on the other, we are back to square one... And while musing on Gil's wise words I ran the unaligned access test again, and what do you know:

Bugger. Bugger. Bugger. It seems like a special value sneaks in every once in a while. This happens quite infrequently and as things worked out it failed to happen until my little exchange with Gil was over, leading me to believe Gil possesses that power of the experienced to have reality demonstrate for them on cue. On the other hand, now I know it doesn't work. Certainty at last.
What about aligned memory access? I gave it a good run and saw no ill effects, but given Gil's warnings this may not be enough. The JCTS already had a test for aligned access and there are no particular expectations there. This is important: it is OK for a JVM implementation to give you no atomicity on putLong/Int/... (via Unsafe or DirectByteBuffer). Aligned access on x86 is atomic, but as Gil points out we don't have full control on every level so that may not be a strong enough guarantee...

Conclusion?

The take away from the above is that offheap storage should be treated as if atomicity is not guaranteed, unless read/write alignment is properly handled. Is that the end of the world? certainly not, but it does mean precautions must be taken if data is to be written and read concurrently. Note that mutating shared data is always a risk, but you may go further without noticing issues with a POJO approach. Also note that some of the tools used for proper object publication are not available, in particular final fields.

Many thanks to Aleksey Shipilev and Gil Tene for their feedback and guidance along the way. I leave word tearing as an exercise to the interested reader.

Update 14/02/2013:
Martin Thompson got back to me asking if the same atomicity issue is still there for putOrdered*/put*Volatile when the cache line is crossed. I expected it would be (so did he), wrote some more tests and it turned out we were both correct. It makes no difference if you use put*/putOrdered*/put*Volatile/compareAndSwap* [correction: CAS is atomic see update 23/09/2013 below], if you cross the cache line there is no atomicity. Similarly I would expect unaligned access within the cache line to lack atomicity, but have not had time to add the tests. Code is on GitHub for your pleasure.
Thanks Martin.

Update 24/07/2013:
JCT has gone into the OpenJDK toolbox and is now public again as JCStress (I suggested JangBang, but Shipilev wouldn't listen :-(). I 'forked' it and added my tests to the new project, it's all available here. The tests are under the org.openjdk.jcstress.tests.unsafe.unaligned package.

Update 23/09/2013:
I've written a further post on JCStress here. And a further clarification on the atomicity of unaligned writes using different methods here. The bottom line being that the only method to perform guaranteed atomic unaligned writes is by using CAS. Volatile reads are not atomic however, so this method is of limited use.

Sunday, 6 January 2013

Direct Memory Alignment in Java

Summary: First in a quick(hopefully) series of posts on memory alignment in Java. This post introduces memory alignment, shows how to get memory aligned blocks, and offers an experiment and some results concerning unaligned memory performance.

Since the first days of Java, one of the things you normally didn't need to worry about was memory. This was a good thing for all involved who cared little if some extra memory was used, where it was allocated and when will it be collected. It's one of the great things about Java really. On occasion however we do require a block of memory. A common cause for that is for doing IO, but more recently it has been used as a means to having off-heap memory which allows some great performance benefits to the right use case. Even more recently there has been talk of bringing back C-style structs to Java, to give programmers better control over memory layout. Having stayed away from direct memory manipulation for long(or having never had to deal with it) you may want to have a read of the most excellent series of articles "What Every Programmer Should Know About Memory" by Ulrich Drepper. I'll try and cover the required material as I go along.

Getting it straight

Memory alignment is mostly invisible to us when using Java, so you can be forgiven for scratching you head. Going back to basic definitions here:

"A memory address a, is said to be n-byte aligned when n is a power of two and a is a multiple of n bytes.

A memory access is said to be aligned when the datum being accessed is n bytes long and the datum address is n-byte aligned. When a memory access is not aligned, it is said to be misaligned. Note that by definition byte memory accesses are always aligned."

In other words for n which is a power of 2:

boolean nAligned = (address%n) == 0;

Memory alignments of consequence are:

Type alignment(1,2,4,8) - Certain CPUs require aligned access to memory and as such would get upset(atomicity loss) if you attempt misaligned access to your data. E.g. if you try to MOV a long to/from an address which is not 8-aligned.
Cache line alignment(normally 64, can be 32/128/other) - The cache line is the atomic unit of transport between the main memory and the different CPU caches. As such packing relevant data to that alignment and to that punctuation can make a difference to your memory access performance. Other considerations are required here due to the rules of interaction between the CPU and the cache lines(false sharing, word tearing, atomicity)
Page size alignment(normally 4096, can be configured by OS) - A page is the smallest chunk of memory transferable to an IO device. As such aligning your data to page size, and using the page size as your allocation unit can make a difference to interactions when writing to disk/network devices.

Memory allocated in the form of objects (anything produced by 'new') and primitives is always type aligned and I will not be looking much into it. If you want to know more about what the Java compiler make of you objects, this excellent blog post should help.
As the title suggests I am more concerned with direct memory alignment. There are several ways to get direct memory access in Java, they boil down to 2 methods:

JNI libraries managing memory --> not worried about those for the moment
Direct/MappedByteBuffer/Unsafe which are all the same thing really. The Unsafe class is at the bottom of this gang and is the key to official and semi-official access to memory in Java.

The memory allocated via Unsafe.allocatememory will be 8 byte aligned. Memory acquired via ByteBuffer.allocateDirect will ultimately go through the Unsafe.allocatememory but will differ in 2 important ways:

ByteBuffer memory will be zero-ed out.
Up until JDK6 direct byte buffers were page aligned at the cost of allocating an extra page of memory for every byte buffer allocated. If you allocated lots of direct byte buffers this got pretty expensive. People complained and as of JDK7 this is no longer the case. This means that direct ByteBuffer are now only 8 byte aligned.
Less crucial, but convenient: memory allocated via ByteBuffer.allocateDirect is freed as part of the ByteBuffer object GC. This mechanism is only half comforting as it depends on the finalizer being called[UPDATE: things are a tad more complex than this, but this is not the topic of this post... DirectByteBuffer cleanup is done by a sun.misc.Cleaner, which is a PhatomRefeference to the buffer and will trigger deallocation when the PhantomReference is processed, see the code for more details], and that is not much to depend on.

Allocating an aligned ByteBuffer

No point beating about the bush, you'll find it's very similar to how page alignment works/used to work for ByteBuffers, here's the code:

To summarize the above, you allocate as much extra memory as the required alignment to ensure you can position a large enough memory slice in the middle of it.

Comparing Aligned/Unaligned access performance

Now that we can have an aligned block of memory to play with, lets test drive some theories. I've used Caliper as the benchmark framework for this post (it's great, I'll write a separate post on my experience/learning curve) [UPDATE: This post is from the dark pre-JMH days, these days you should NOT use Caliper, use JMH instead] and intend to convert previous experiments to use it when I have the time. The theories I wanted to test out in this post are:

Theorem I: Type aligned access provides better performance than unaligned access.
Theorem II: Memory access that spans 2 cache lines has far worse performance than aligned mid-cache line access.
Theorem III: Cache line access performance changes based on cache line location.

To test the above theorems I put together a small experiment/benchmark, with an algorithm as follows:

Allocate a page aligned memory buffer of set sizes: 1,2,4,8,16,32,64,128.256,512,1024,2048 pages. Allocation time is not measured but done up front. The different sizes will highlight performance variance which stems from the cache hierarchy of sizes.
We will write a single long per cache line (iterate through the whole memory block), then go back and read the written values (iterate through the whole memory block). There are usually 64 cache lines in a page (4096b/64b case), but we can only write 63 longs which will 'straddle' 2 lines, so the non-straddled offset will skip the first line to have the same number of operations.
Repeat step 2 (2048/(size in pages)) times. The point is to do the same number of memory access operations per experiment.

Here is the code for the experiment:

Before moving on to the results please note that this test is very likely to yield different results on different hardware, so your mileage is very likely to vary. If you find the time to run the test on your own hardware I'd be very curious to compare notes. I used my laptop, which has an i5-2540M processor(Sandy Bridge, dual core, hyper threaded, 2.6Ghz). The L1 Cache is 32K, L2 is 256K and L3 is 3M. This is important in the context of the changes in behaviour we would expect to see as the data set exceeds the capacity of each cache. Should you run this experiment on a machine with different cache allocations, expect your results to vary in accordance.
This experiment is demonstrating a very simple use case(read/write long). There is no attempt to enjoy the many features the CPU offers which may indeed stronger alignment than simple reading and writing. As such take the results to be limited in scope to this use case alone.
Finally, note that the charts reflect relative performance to offset 0 memory access. As the performance varied also as a function of memory buffer size I felt this helps highlight the difference in behaviour which is related to offset rather than size. The effect of the size of the memory buffer is reflected in this chart, showing time as function of size for offset 0:

The chart shows how the performance is determined by the cache the memory buffer fits into. The caches size in pages is 8 for L1, 64 for L2, and 768 for L3 which correspond to the steps(200us, 400us, 600us) we see above, with a final step into main memory(1500us). While not the topic at hand it is worth noting the effect a reduced data set size can have on your application performance, and by extension the importance of realistic data sets in benchmarking.

Theorem I: Type aligned access provides better performance than unaligned access - NOT ESTABLISHED (for this particular use case)

To test Theorem I I set offset to values 0 to 7. Had a whirl and to my surprise found it made no bloody difference! I double and triple checked and still no difference. Here's a chart reflecting the relative cost of memory access:

What seems odd is the difference in performance for the first 4 byte, which are significantly less performant then the rest for memory buffer sizes which fit in the L1 cache(32k = 4k * 8 = 8 pages). The rest show relatively little variation and for the larger sizes there is hardly any difference at all. I have been unable to verify the source of the performance difference...
What is this about? Why would Mr. Drepper and many others on the good web lie to me? As it turns out Intel in their wisdom have sorted this little nugget out for us recently with Nehalem generation of CPUs, such that unaligned type access has no performance impact. This may not be the case on other non-Intel CPUs, or older Intel CPU. I ran the same experiment on my i5 laptop and my Core2 desktop, but could not see a significant difference. I have to confess I find this result a bit alarming given how it flies in the face of accepted good practice. A more careful read of Intel's Performance Optimization Guide reveals the recommendation to align your type stands, but it is not clear what adverse effects it may have for the simple use case we have here. I will attempt to construct a benchmark aimed at uncovering those issues in a further post perhaps. For now it may comfort you to know that breaking the alignment rule seems to not be the big issue it once was on x86 processors. Note that behaviour on non-x86 processors may be wildly different.

Theorem II: Memory access that spans 2 cache lines has far worse performance than aligned mid-cache line access - CONFIRMED

To test Theorem II I set the offset to 60. The result was a significant increase in the run time of the experiment. The difference was far more pronounced on the Core2 (x3-6 the cost of normal access) then on the i5 (roughly x1.5 the cost of normal access), and we can look forward to it being less significant as time goes by. This result should be of interest to anybody using direct memory access as the penalty is quite pronounced. Here's a graph comparing offset 0,32 and 60:

The other side effect of cache line straddling is loss of update atomicity, which is a concern for anyone attempting concurrent direct access to memory. I will go back to the concurrent aspects of aligned/unaligned access in a later post.

Theorem III: Cache line access performance changes based on cache line location - NOT ESTABLISHED

This one is based on Mr. Dreppers article:

3.5.2 Critical Word Load

Memory is transferred from the main memory into the caches in blocks which are smaller than the cache line size. Today 64 bits are transferred at once and the cache line size is 64 or 128 bytes. This means 8 or 16 transfers per cache line are needed.
The DRAM chips can transfer those 64-bit blocks in burst mode. This can fill the cache line without any further commands from the memory controller and the possibly associated delays. If the processor prefetches cache lines this is probably the best way to operate.
If a program's cache access of the data or instruction caches misses (that means, it is a compulsory cache miss, because the data is used for the first time, or a capacity cache miss, because the limited cache size requires eviction of the cache line) the situation is different. The word inside the cache line which is required for the program to continue might not be the first word in the cache line. Even in burst mode and with double data rate transfer the individual 64-bit blocks arrive at noticeably different times. Each block arrives 4 CPU cycles or more later than the previous one. If the word the program needs to continue is the eighth of the cache line, the program has to wait an additional 30 cycles or more after the first word arrives

To test Theorem III I set the offset values to the different type aligned locations on a cache line for a long: 0,8,16,24,32,40,48,56. My expectation here was to see slightly better performance for the 0 location as explained in Mr. Drepper's paper, with degrading performance as you get further from the begining of the cache line, at least for the large buffer sizes where data is always fetched from main memory. Here's the result:

The results surprised me. As long as the buffer fit in my L1 cache, writing/reading from the first 4 bytes was significantly more expensive than elsewhere on the line, as discussed in Theorem I, but the other locations did not have measurable differences in cost. I pinged Martin Thompson which pointed me at the Critical Word First feature/optimization explained here:

Critical word first takes advantage of the fact that the processor probably only wants a single word by signaling that it wants the missed word to appear first in the block. We receive the first word in the block (the one we actually need) and pass it on immediately to the CPU which continues execution. Again, the block continues to fill up in the "background".

I was unable to determine when exactly Intel introduced the feature to their CPU's, and I'm not sure which processors fail to supply it.

Conclusions/Takeaways

Of the 3 concerns stated, only cache line straddling proved to be a significant performance consideration. Cache line straddling impact is diminishing in later Intel processors, but is unlikely to disappear altogether. As such it should feature in the design of any direct memory access implementation and may be of interest to serializing/de-serializing code.
The other significant factor is memory buffer size, which is obvious when one considers the cache hierarchy. This should prompt us to make an effort towards more compact memory structures. In light of the negligible cost of type mis-alignment we may want to skip type alignment driven padding of structures when implementing off-heap memory stores.
Finally, this made me realize just how fast moving is the pace of development in hardware which invalidates past conventional wisdom/assumptions. So yet another piece of evidence validating the call to 'Measure, Measure, Measure' all things performance.

Running the benchmarks yourself

Code is on GitHub
Run 'ant run-unaligned-experiments'
Wait... it takes a while to finish
Send me the output?

To get the most deterministic results you should run the experiments pinned to a core which runs nothing else, that would reduce noise from other processes 'polluting' the cache.

Many thanks to Martin, Aleksey Shipilev and Darach Ennis for proof reading and providing feedback on initial drafts, any errors found are completely their fault :-).

[UPADTE 22/06/2015: Experiments were repeated in JMH form, arriving at similar conclusions with great insight and cross platform testing by Aleksey Shipilev.]