Psychosomatic, Lobotomy, Saw: Memory layout

Showing posts with label Memory layout. Show all posts

Thursday, 26 June 2014

Notes on False Sharing

{This post is part of a long running series on lock free queues, checkout the full index to get more context here}
I've recently given a talk on queues and related optimisations, but due to the limitations of time, the Universe and Everything else had to cut short the deep dive into false sharing. Here however I'm not limited by such worldly concerns, so here is more detail for those who can stomach it.

What mean False Sharing?

Described before on this blog and in many other places (Intel, usenix, Mr. T), but assuming you are not already familiar with the issue:

"false sharing is a performance-degrading usage pattern that can arise in systems with distributed, coherent caches at the size of the smallest resource block managed by the caching mechanism. When a system participant attempts to periodically access data that will never be altered by another party, but that data shares a cache block with data that is altered, the caching protocol may force the first participant to reload the whole unit despite a lack of logical necessity. The caching system is unaware of activity within this block and forces the first participant to bear the caching system overhead required by true shared access of a resource." - From Wikipedia

Makes sense, right?
I tried my hand previously at explaining the phenomena in terms of the underlying coherency protocol which boils down to the same issue. These explanations often leave people still confused and I spent some time trying to reduce the explanation so that the idea can get across in a presentation... I'm pretty sure it still left people confused.
Let's try again.
The simplest explanation I've come up with to date is:

Memory is cached at a granularity of a cache line (assumed 64 bytes, which is typical, in the following examples but may be a smaller or larger power of 2, determined by the hardware you run on)
Both reading and writing to a memory location from a particular CPU require the presence of a copy of the memory location in the reading/writing CPU cache. The required location is 'viewed' or cached at the granularity of a cache line.
When a line is not in a threads cache we experience a cache miss as we go searching for a valid copy of it.
Once a line is in our cache it may get evicted (when more memory is required) or invalidated (because a value in that line was changed by another CPU). Otherwise it just stays there.
A write to a cache line invalidates ALL cached copies of that line for ALL CPUs except the writer.

I think there is a mistaken intuition most of us have when it comes to memory access, which is that we think of it as direct. In reality however you access memory through the CPU cache hierarchy and it is in the attempt to make this system coherent and performant that the hardware undoes our intuition. Rather than thinking of all CPUs accessing a global memory pool we should have a mental model that is closer to a distributed version control system as illustrated in this excellent post on memory barriers.
Given the above setting of the stage we can say that false sharing occurs when Thread 1 invalidates (i.e writes to) a cache line required by Thread2 (for reading/writing) even though Thread2 and Thread1 are not accessing the same location.
The cost of false sharing is therefore observable as cache misses, and the cause of false sharing is the write to the cache line (coupled with the granularity of cache coherency). Ergo, the symptom of false sharing can be observed by looking at hardware events for your process or thread:

If you are on linux you can use perf. (Checkout the awesome perf integration in JMH!)
On linux/windows/mac Intel machines there is PCM.
From Java you can use Overseer
... look for hardware counters on google.

Now assuming you have established that the number of cache misses you experience is excessive, the challenge remains in finding the sources of these cache misses. Sadly for Java there is no free utility that I'm aware of which can easily point you to the source of your false sharing problem, C/C++ developers are spoilt for choice by comparison. Some usual suspects to consider however are any shared data constructs, for example queues.

Active False Sharing: Busy Counters

Lets start with a queue with 2 fields which share the same cache line where each field is being updated by a separate thread. In this example we have an SPSC queue with the following structure:
Ignoring the actual correct implementation and edge cases to make this a working queue (you can read more about how that part works here), we can see that offer will be called by one thread, and poll by another as this is an SPSC queue for delivering messages between 2 threads. Each thread will update a distinct field, but as the fields are all jammed tight together we know False Sharing is bound to follow.
The solution at the time of writing seems to be the introduction of padding by means of class inheritance which is covered in detail here. This will add up to an unattractive class as follows:
Yipppeee! the counters are no longer false sharing! That little Pad in the middle will make them feel all fresh and secure.

Passive False Sharing: Hot Neighbours

But the journey is far from over (stay strong), because in False Sharing you don't have to write to a shared cache line to suffer. In the above example both the producer and the consumer are reading the reference to buffer which is sharing a cache line with naughty Mr. consumerIndex. Every time we update consumerIndex an angel loses her wings! As if the wings business wasn't enough, the write also invalidates the cache line for the producer which is trying to read the buffer so that it can write to the damn thing. The solution?
MORE PADDING!!
Surely now everyone will just get along?

The Object Next Door

So now we have the consumer and the producer feverishly updating their respective indexes and we know the indexes are not interfering with each other or with buffer, but what about the objects allocated before/after our queue object? Imagine we allocate 2 instances of the queue above and they end up next to each other in memory, we'd be practically back to where we started with the consumerIndex on the tail end of one object false sharing with the producer index at the head end of the other. In fact, the conclusion we are approaching here is that rapidly mutated fields make very poor neighbours altogether, be it to other fields in the same class or to neighbouring classes. The solution?

EVEN MORE PADDING!!!

Seems a tad extreme don't it? But hell, at least now we're safe, right?

Buffers are people too

So we have our queue padded like a new born on it's first trip out the house in winter, but what about Miss buffer? Arrays in Java are objects and as such the buffer field merely point to another object and is not inlined into the queue (as it can be in C/C++ for instance). This means the buffer is also subject to potentially causing or suffering from false sharing. This is particularly true for the length field on the array which will get invalidated by writes into the first few elements of the queue, but equally true for any object allocated before or after the array. For circular array queues we know the producer will go around writing to all the elements and come back to the beginning, so the middle elements will be naturally padded. If the array is small enough and the message passing rate is high enough this can have the same effect as any hot field. Alternatively we might experience an uneven behaviour for the queue as the elements around the edges of the array suffer false sharing while the middle ones don't.

Since we cannot extend the array object and pad it to our needs we can over-allocate the buffer and never use the initial/last slots:

We cannot prevent false sharing between neighbours and the array length, what we can do is hoist it into our own class field and use our copy instead of the buffer field. This can also result in better locality if the hoisted length value is hosted in a field alongside the array itself.
Note that this is costing us some computational overhead to compute the offset index instead of the natural one, but in practice we can implement queues such that the we achieve the same effect with no overhead at all.

Protection against the Elements

While on this topic (I do go on about it... hmmm) we can observe that the producer and the consumer threads can still experience false sharing on the cache line holding adjacent elements in the buffer array. This is something we can perhaps handle more effectively if we had arrays of structs in Java (see the value types JEP and structured arrays proposal), and if the queue was to be a queue of such structs. Reality being the terrible place that it is, this is not the case yet. If we could prove our application indeed suffered from this issue, could we solve it?
Yes, but at a questionable price...
If we allocate extra elements for each reference we mean to store in the queue we can use empty elements to pad the used elements. This will reduce the density of each cache line and as a result the probability of false sharing as less and less elements are in a cache line. This comes a high cost however as we multiply the size of the buffer to make room for the empty slots and actively sabotage the memory throughput as we have less data per read cache line. This is an optimization I am reluctant to straight out recommend as a result, but what the hell? sometimes it might helps.

Playing with Dirty Cards

This last one is a nugget and a half. Usually when people say: "It's a JVM/compiler/OS issue" it turns out that it's a code issue written by those same people. But sometimes it is indeed the machine.

In certain cases you might not be the cause of False Sharing. In some cases you might be experiencing False Sharing induced by card marking. I won't spoil it for you, follow the link.

Summary

So there you have it, 6 different cases of False Sharing encountered in the process of optimizing/proofing the piddly SPSC queue. Each one had an impact, some more easily measured than others. The take away here is not "Pad every object and field to death" as that will be detrimental in most cases, much like making every field volatile. But this is an issue worth keeping in mind when writing shared data structures, and in particular when considering highly contended ones.
We've discussed 2 types of False Sharing:

Active: where both threads update distinct locations on the same cache line. This will severely impact both threads as they both experience cache misses and cause repeated invalidation of the cache line.
Passive: where one thread writes to and another reads from distinct locations on the same cache line. This will have a major impact on the reading thread with many reads resulting in a cache miss. This will also have an impact on the writer as the cache line sharing overhead is experienced.

These were identified and mitigated in several forms:

Neighbouring hot/hot fields (Active)
Neighbouring hot/cold fields (Passive)
Neighbouring objects (potential for both Active and Passive)
Objects neighbouring an array and the array elements (same as 3, but worth highlighting as a subtlety that arrays are separate objects)
Array elements and array length field (Passive as the length is never mutated)
Distinct elements sharing the same cache line (Passive in above example, but normally Active as consumer nulls out the reference on poll)
-XX:+UseCondCardMark - see Dave Dice article

What should you do?

Consider fields concurrently accessed by different threads, in particular frequently written to fields and their neighbours.
Consider Active AND Passive false sharing.
Consider (and be considerate to) your neighbours.
All padding comes at the cost of using up valuable cache space for isolation. If over used (like in the elements padding) performance can suffer.

And finally: love thy neighbour, just a little bit and with adequate protection and appropriate padding, but still love'em.

Belgium! Belgium! Belgium! Belgium! Belgium! Belgium! Belgium! Belgium! Belgium! Belgium! Belgium!

Friday, 28 March 2014

Java Object Layout: A Tale Of Confusion

Buy the T-shirt

Following the twists and turns of the conversation on this thread in the Mechanical Sympathy mailing list highlights how hard it is to reason about object layout based on remembered rules. Mostly every person on the thread is right, but not completely right. Here's how it went...

Context

The thread discusses False Sharing as described here. It is pointed out that the padding is one sided and padding using inheritance is demonstrated as solution. The merits of using inheritance vs. using an array and utilizing Unsafe to access middle element (see Disruptor's Sequence) vs. using AtomicLongArray to achieve the same effect are discussed (I think the inheritance option is best, as explored here). And then confusion erupts...

What's my layout?

At this point Peter L. makes the following point:

[...]in fact the best option may be.
class Client1 {
private long value;
public long[] padding = new long[5];
}

What follows is a series of suggestions on what the layout of this class may be.

Option 1: My recollections

I was too lazy to check and from memory penned the following:

[...] the ordering of the fields may end up shifting the reference (if it's 32bit or CompressedOop) next to the header to fill the pad required for the value field. The end layout may well be:
12b header
4b padding(oop)
8b value

Option 2: Simon's recollections

Simon B. replied:

[...] I thought Hotspot was laying out longs before
references, and that the object header was 8 Bytes.
So I would expect Client1 to be laid out in this way:
8B header
8B value
4B padding (oop)
[...] Am I out of date in object layout and header size ?

Option 3: TM Jee's doubts

Mr. Jee slightly changed the class and continued:

for:
class Client1 {
private long value;
public long[] padding = new long[5]
public Object[] o = new Object[1];
}
the memory layout should be something like
12b header (or is it 16b)
8b value
4b for the long[] (its just the reference which is 4b for compressed and 8b if not)
4b for the Object[] (again it's just the reference)
Is this right so far?

To which Peter L. wisely replied:

Yes. But as you recognise the sizes of the header and sizes of references are not known until runtime.

Option 4: Check...

So I used JOL to check. And as it turns out we are all somewhat right and somewhat wrong...
I'm right for compressed oops (the default for 64bit):
Running 64-bit HotSpot VM.
Using compressed references with 3-bit shift.
Client1 object internals:
OFFSET SIZE TYPE DESCRIPTION
0 4 (object header)
4 4 (object header)
8 4 (object header)
12 4 long[] Client1.padding
16 8 long Client1.value
24 4 Object[] Client1.o
28 4 (loss due to the next object alignment)

The header is 12b and the array reference is shifted up to save on space. But my casual assumption 32bit JVM layout will be the same is wrong.

Simon is right that the header is 8b (but only for 32bit JVMs) and that references will go at the end (for both 32bit and 64bit, but not with compressed oops):
Running 32-bit HotSpot VM.
Client1 object internals:
OFFSET SIZE TYPE DESCRIPTION
0 4 (object header)
4 4 (object header)
8 8 long Client1.value
16 4 long[] Client1.padding
20 4 Object[] Client1.o

And finally with 64bit Mr. Jee is right too:
Running 64-bit HotSpot VM.
Client1 object internals:
OFFSET SIZE TYPE DESCRIPTION
0 4 (object header)
4 4 (object header)
8 4 (object header)
12 4 (object header)
16 8 long Client1.value
24 8 long[] Client1.padding
32 8 Object[] Client1.o

And Peter is entirely right to point out the runtime is the crucial variable in this equation.

Lesson?

If you catch yourself wondering about object layout:

Use JOL to check, it's better than memorizing rules
Remember that 32/64/64+Oops are different for Hotspot, and other JVMs may have different layouts altogether
Read another post about java memory layout

Monday, 7 October 2013

SPSC revisited part III - FastFlow + Sparse Data

{This post is part of a long running series on lock free queues, checkout the full index to get more context here}
After I published the last post on this topic (if you want to catchup on the whole series see references below) I got an interesting comment from Pressenna Sockalingasamy who is busy porting the FastFlow library to Java. He shared his current implementation for porting the FastFlow SPSC queue (called FFBuffer) and pointed me at a great paper on it's implementation. The results Pressenna was quoting were very impressive, but when I had a close look at the code I found what I believed to be a concurrency issue, namely there were no memory barriers employed in the Java port. Our exchange is in the comments of the previous post. My thoughts on the significance of memory barriers, and why you can't just not put them in are recorded here.
Apart from omitting the memory barriers the FFBuffer offered two very interesting optimizations:

Simplified offer/poll logic by using the data in the queue as an indicator. This removes the need for tracking the relative head/tail position and using those to detect full/empty conditions.
Sparse use of the queue to reduce contention when the queue is near full or near empty (see musings below). This bloats the queue, but guarantees there are less elements per cache line (this one is an improvement put in by Pressenna himself and not part of the original)

So after getting off my "Memory Barriers are important" high horse, I stopped yapping and got to work seeing what I can squeeze from these new found toys.

FFBuffer - Simplified Logic

The simplified logic is really quite compelling. If you take the time to read the paper, it describes the queue with offer/poll loop I've been using until now as Lamport's circular buffer (the original paper by Lamport is here, not an easy read), full code is here, but the gist is:

I've been using a modified version which I got from a set of examples put forth by Martin Thompson as part of his lock free algos presentation and evolved further as previously discussed(padding is omitted for brevity):

With the FFBuffer variant Pressenna sent my way we no longer need to do all that accounting around the wrap point (padding is omitted for brevity):

The new logic is simpler, has less moving parts (no head/tail caching) and in theory looks like cache coherency traffic should be reduced by only bringing in the array elements which are required in any case.

The Near Empty/Full Queue Problem

At some point while testing the different queues I figured the test harness itself may be to blame for the run to run variance I was seeing. The fun and games I had are a story to be told some other time, but one of the thing that emerged from these variations on the test harness was the conclusion that testing a near empty queue produces very different results. By reversing the thread relationship (Main is producer, other thread is consumer, code here) in the test harness I could see great performance gains, the difference being that the producer was getting a head start in the race, creating more slack to work with. This was happening because the consumer and the producer are experiencing contention and false sharing when the queue has 1 to 16 elements in it. The fact that the consumer is reading a cache row that is being changed is bad enough, but it is also changing the same line by setting the old element to null. The nulling out of elements as they are read is required to avoid memory leaks. Also note that when the consumer/producer run out of space to read/write as indicated by their cached view of the tail/head they will most likely experience a read miss with the implications discussed previously here.
Several solutions present themselves:

Batch publication: This is a problem for latency sensitive applications and would require the introduction of some timely flush mechanism to the producer. This is not an issue for throughput focused applications. Implementing this solution will involve the introduction of some batch array on the producer side which is 'flushed' into the queue when full. If the batch array is longer than a cache line the sharing issue is resolved. An alternative solution is to maintain a constant distance between producer and consumer by changing the expectation on the consumer side to stop reading at a given distance.
Consumer batch nulling: Delay the nulling out of read elements, this would reduce the contention but could also result in a memory leak if some elements are never nulled out. This also means the near empty queue is still subject to a high level of coherency traffic as the line mutates under the feet of the reader.
Padding the used elements: We've taken this approach thus far to avoid contention around fields, we can apply the same within the data buffer itself. This will require us to use only one in N cells in the array. We will be trading space for contention, again. This is the approach taken above and it proves to work well.

Good fences make good neighbours

As mentioned earlier, the code above has no memory barriers, which made it very fast but also incorrect in a couple of rather important ways. Having no memory barriers in place means this queue can:

Publish half constructed objects, which is generally not what people expect from a queue. In a way it means that notionally stuff can appear on the other side of the queue before you sent it (because the code is free to be reordered by the compiler).
The code can be reordered such that the consumer ends up in an infinite busy loop (the read value from the array can be hoisted out of the reading loop).

I introduced the memory barriers by making the array read volatile and the array write lazy. Note that in the C variation only a Write Memory Barrier is used, I have implemented variations to that effect but while testing those Pressenna informed me they hang when reading/writing in a busy loop instead of yielding (more on this later). Here's the final version (padding is omitted for brevity):

There are a few subtleties explored here:

I'm using Unsafe for array access, this is nothing new and is a cut and paste job from the AtomicReferenceArray JDK class. This means I've opted out of the array bound checks we get when we use arrays in Java, but in this case it's fine since the ring buffer wrapping logic already assures correctness there. This is necessary in order to gain access to getVolatile/putOrdered.
I switched Pressanas original field padding method with mine, mostly for consistency but also it's a more stable method of padding fields (read more on memory layout here).
I doubled the padding to protect against pre-fetch caused false sharing (padding is omitted above, but have a look at the code).
I replaced the POW final field with a constant (ELEMENT_SHIFT). This proved surprisingly significant in this case. Final fields are not optimized as aggressively as constants, partly due to the exploited backdoor in Java allowing the modification of final fields (here's Cliff Click's rant on the topic). ELEMENT_SHIFT is the shift for the sparse data and the shift for elements in the array (inferred from Unsafe.arrayIndexScale(Object[].class)) combined.

How well did this work? Here is a chart comparing it to the the latest queue implementation I had (Y5 with/Y4 without Unsafe array access), with sparse data shift set to 0 and 2 (benchmarked on Ubuntu13.04/JDK7u40/i7-4700MQ@2.40GHz using JVM params: -XX:+UseCondCardMark -XX:CompileThreshold=100000 and pinned to run across cores, results from 30 runs are sorted to make the the charts more readable):

Y=Ops per second, X=benchmark run

Shazzam! This is a damn fine improvement note that the algorithm switch is far less significant than the use of sparse data.
To further examine the source of the gain I benchmarked a variant with the element shift parametrized and a further variant with single padding (same machine and settings as above, sparse shift is 2 for all):

Y=Ops per second, X=benchmark run

Single padding seems to make little difference but parametrization is quite significant.

Applying lessons learnt

Along the way, as I was refining my understanding of the FFBuffer optimizations made above I applied the same to the original implementation I had. Here are the incremental steps:

Y4 - Starting point shown above.
Y5 - Use Unsafe to access array (regression).
Y6 - Use sparse elements in the buffer.
Y7 - Double pad the fields/class/array.
Y8 - Put head/tailCache on same line, put tail/headCache on another line.
Y81 - From Y8, Parametrize sparse shift instead of constant (regression).
Y82 - From Y8, Revert to using array instead of Unsafe access (regression).
Y83 - From Y8, Revert double padding of fields/class/array (regression).

Lets start with the journey from Y4 to Y8:

Y=Ops per second, X=benchmark run

Note the end result is slightly better than the FF one, though slightly less stable. Here is a comparison between Y8 and it's regressions Y81 to Y83:

Y=Ops per second, X=benchmark run

So... what does that tell us?

The initial switch to using Unsafe for array access was a regression as a first step (Y5), but proves a major boost in the final version. This is the Y82 line. The step change in its performance I think is down to the JIT compiler lucking out on array boundary check elimination, but I didn't check the assembly to verify.
Parametrizing is an issue here as well, more so than for the FFBuffer implementation for some reason.
I expected Y83 (single padding) to be the same or worse than Y6. The shape is similar but it is in fact much better. I'm not sure why...
There was a significant hiccup in the field layout of Y4/5/6 resulting in the bump we see in going from Y6 to Y7 and the difference between Y83 and Y6. As of now I'm not entirely certain what is in play here. It's clear that pre-fetch is part of it, but I still feel I'm missing something.

So certainly not all is clear in this race to the top but the end result is a shiny one, let's take a closer look at the differences between Y8 and FFO3. Here's a comparison of these 2 with sparse data set to 0,1,2,3,4 running cross cores:

From these results it seems that apart from peaking for the s=2 value, Y8 is generally slower than FFO3. FFO3 seems to improve up to s=3, then drop again for s=4. To me this suggests FFO3 is less sensitive to the effect of the near empty queue as an algorithm.
Next I compared the relative performance of the 2 queue when running on the same core, this time only s=0, 2 where considered:

This demonstrates that FFO3 is superior when threads are running on the same core and still benefits from sparse data when running in that mode. Y8 is less successful and seem to actually degrade in this scenario when sparse data is used. Y8 offers a further optimization direction I have yet to fully explore in the form of batching consumer reads, I will get back to it in a further post.

On Testing And A Word Of Warning

I've been toying around with several test harnesses which are all on GitHub for you to enjoy. The scripts folder also allows running the tests in different modes (you may want to adjust the taskset values to match your architecture). Please have a play. Mostly the harnesses differ in the way they measure the scope of execution and which thread (producer/consumer) acts as the trigger for starting the test and so on. Importantly you will find there is the BusyQueuePerfTest which removes the Thread.yield() call when an unsuccessful offer/poll happens.
During the long period in which this post has been in the making (I was busy, wrote other posts... I have a day job... it took forever) Pressana has not been idle and got back to me with news that all the implementations which utilize Unsafe to access the array (Y5/SPSCQueue5 and later, most of the FFBuffer implementations) were in fact broken when running the busy spin test and would regularly hang indefinitely. I tested locally and found I could not reproduce the issue at all for the SPSCQueue variations and could only reproduce it for the FFBuffer implementations which used no fences (the original) or only the STORE/STORE barrier (FFO1/2). Between us we have managed to verify that this is not an issue for JDK7u40 which had a great number of concurrency related bug fixes in it, but is an issue for previous versions! I strongly suggest you keep this in mind and test thoroughly if you intend to use this code in production.

Summary

This post has been written over a long period, please forgive any inconsistencies in style. It covers a long period of experimentation and is perhaps not detailed enough... please ask for clarifications in comments and I will do my best to answer. Ultimately, the source is there for you to examine and play with and has the steps broken down such that a diff between step should be helpful.

This is also now part 5 in a series of long running posts, so if you missed any or forgotten it all by now here they are:

Single Producer/Consumer lock free Queue step by step: This is covering the initial move from tradition JDK queues to the lock free variation used as a basis for most of the other posts
135 Million messages a second between processes in pure Java: Taking the queue presented in the first post and taking it offheap and demonstrating its use as an IPC transport via a memory mapped file
Single Producer Single Consumer Queue Revisited: An empiricist tale - part I: Looking at run to run variation in performance in the original implementation and trying to cure it by inlining the counter fields into the queue class
SPSC Revisited - part II: Floating vs Inlined Counters: Here I tried to access to impact of inlining beyond fixing up the initial implementations exposure to false sharing on one side.

And here we are :-), mostly done I think.

I would like to thank Pressana for his feedback, contribution and helpful discussion. He has a website with some of his stuff on, he is awesome! As mentioned above the idea for using sparse data is his own (quoting from an e-mail):

"All I tried was to cache align access to the elements in the array as it is done in plenty of c/c++ implementations. I myself use arrays of structs to ensure cache alignment in C. The same principle is applied here in Java where we don't have structs but only an array of pointer. "

Finally, the end result (FFO3/Y8) is incredibly sensitive to code changes as the difference between using a constant and field demonstrates, but also performs very consistently between runs. As such I feel there is little scope for improvement left there... so maybe now I can get on to some other topics ;-).

Sunday, 15 September 2013

Diving Deeper into Cache Coherency

I recently gave a talk about Mechnical Sympathy at Facebook, which was mostly a look at the topic through the SPSC queue optimization series of posts. During the talk I presented the optimisation step (taken in Martin's original series of queue implementations) of adding a cache field for the head and tail. Moving from this:

To this:

This change yields a great improvement in throughput which I explained to be down to the reduction in coherency traffic between the cores. This is:

Not what I stated previously here.
A very brief and unsatisfying explanation.

I shall try and remedy that in this post.

Penance

In my original step by step post I explained the benefit of the above optimization is the replacement of a volatile read with a plain read. Martin was kind enough to gently correct me in the comments:

"The "volatile read" is not so much the issue. The real issues come from the read of the head or tail from the opposite end will pretty much always result in a cache miss. This is because the other side is constantly changing it."

and I argued back, stating the JIT compiler can hoist the field into a register and wistfully wishing I could come up with an experiment to prove one way or the other. I was mostly wrong and Martin was totally right.
I say mostly and not entirely because eliminating volatile reads is not a bad optimization and is responsible for part of the perf improvement. But it is not the lion share. To prove this point I have hacked the original version to introduce volatile reads (but still plain writes) from the headCache/tailCache fields. I won't bore you with the code, it's using the field offset and Unsafe.getLongVolatile to do it.
I ran the test cross core for 30 times and averaged the summary results. Running on new hardware (so no reference point to prev. quoted results) i7-4700MQ/Ubuntu 13.04/JDK7u25. Here are the results, comparing Original21 (head/tail padded, no cache fields), Original3 (head/tail padded, head/tailCache padded) and VolatileRead (same as Original3, but with volatile read of cache fields):

Original21: 65M ops/sec
Original3: 215M ops/sec
VolatileRead: 198M ops/sec

As we can see, there is no massive difference between Original3 and VolatileRead (especially when considering the difference from Original21), leading me to accept I need to buy Martin a beer, and apologize to any reader of this blog who got the wrong idea reading my post.
Exercise to the reader: I was going to run the above with perf to quantify the impact on cache-misses, but as perf doesn't support this functionality on Haswell I had to let it go. If someone cares to go through the exercise and post a comment it would be most welcome.

What mean cache coherency?

Moving right along, now that we accept the volatile read reduction is not the significant optimization here, let us dig deeper into this cache coherency traffic business. I'll avoid repeating what is well covered elsewhere (like cache-coherence, or MESIF on wikipedia, or this paper, some excellent examples here, and an animation for windows users only here) and summarize:

MESI State Diagram

Modern CPUs employ a set of caches to improve memory access latency.
To present us with a consistent view of the world when concurrent access to memory happens the caches must communicate to give us a coherent state.
There are variations on the protocol used. MESI is the basic one. Recent intels use MESIF. In the examples below I use MESI as the added state in MESIF adds nothing to the use case.
The guarantee they go for is basic:

One change at a time (to a cache line): A line is never in M state in more than one place.
Let me know if I need a new copy: If I have a copy and someone else changed it I need to get me a new copy. (My copy will move from Shared to Invalid)

All of this happens under the hood of your CPU without you needing to lose any sleep, but if we somehow cause massive amounts of cache coherency traffic then our performance will suffer. One note about the state diagram. The observant reader (thanks Darach) will notice the I to E transition for Processor Write(PW) is in red. This is not my diagram, but I think the reason behind it is that the transition is very far from trivial (or a right pain in the arse really) as demonstrated below.

False Sharing (once more)

False sharing is a manifestation of this issue where 2 threads are competing to write to the same cache line which is constantly Invalid in their own cache.

Here's a blow by blow account of false sharing in terms of cache coherency traffic:

Thread1 and Thread2 both have the false Shared cache line in their cache
Thread1 modifies HEAD (state goes from S to E), Thread2's copy is now Invalid
Thread2 wants to modify TAIL but his cache line is now in state I, so experiences a write miss:

Issue a Read With Intent To Modify (RWITM)
RWITM intercepted by Thread1
Thread1 Invalidates own copy
Thread1 writes line to memory
Thread2 re-issues RWITM which results in read from memory (or next layer cache)

Thread1 wants to write to HEAD :( same song and dance again, the above gets in it's way etc.

The diagram to the right is mine, note the numbers on the left track the explanation and the idea was to show the cache lines as they mutate on a timeline. Hope it helps :-).

Reducing Read Misses

So False Sharing is really bad, but this is not about False Sharing (for a change). The principal is similar though. With False Sharing we get Write/Read misses, the above manoeuvre is about eliminating Read misses. This is what happens without the HEADC/TAILC fields (C for cache):

Thread1 and Thread2 both have the HEAD/TAIL cache lines in Shared state in their cache
Thread1 modifies HEAD (state goes from S to E), Thread2's copy is now Invalid
Thread2 wants to read HEAD to check if the queue wrapped, experiences a Read Miss:

Thread2 issues a request for the HEAD cache line
Read is picked up by Thread1
Thread1 delivers the HEAD cache line to Thread2, the request to memory is dropped
The cache line is written out to main memory and now Thread1/2 have the line in Shared state

This is not as bad as before, but is still far from ideal. The thing is we don't need the latest value of HEAD/TAIL to make progress, it's enough to know that the next increment to our counter is available to be used. So in this use case, caching the head/tail gives us a great win by avoiding all of these Read Misses, we only hit the Read Miss when we run out of progress to be made against our cached values. This looks very different:

Thread1 and Thread2 both have the HEAD/TAIL/HEADC/TAILC cache lines in Shared state in their cache (the TAILC is only in Thread1, HEADC is only in Thread2, but for symmetry they are included).
Thread1 reads TAILC and modifies HEAD (state goes from S to E), Thread2's copy is now Invalid.
Thread2 reads from HEADC which is not changed. If TAIL is less than HEADC we can make progress (offer elements) with no further delay. This continues until TAIL is equal to HEADC. As long as this is the case HEAD is not read, leaving it in state M in Thread1's cache. Similarly TAIL is kept in state M and Thread2 can make progress. This is when we get the great performance boost as the threads stay out of each others way.
Thread1 now needs to read TAIL to check if the queue wrapped, experiences a Read Miss:

Thread1 issues a request for the TAIL cache line
Request is picked up by Thread2
Thread2 delivers the TAIL cache line to Thread1, the request to memory is dropped
The cache line is written out to main memory and now Thread1/2 have the line in Shared state
TAIL is written to TAILC (TAILC line is now E)

The important thing is that we spend most of our time not needing the latest HEAD value which means Thread2 is not experiencing as many Read Misses, with the added benefit of the threads not having to go back and forth from Shared to Exclusive/Modified state on writes. Once I have a line in the M state I can just keep modifying with no need for any coherence traffic. Also note we never experience a Write Miss.

Summary

In an ideal application (from a cache coherency performance POV) threads never/rarely hit a read or write miss. This translates to having single writers to any piece of data, and minimizing reads to any external data which may be rapidly changing. When we are able to achieve this state we are truly bound by our CPU (rather than memory access).

The takeaway here is: Fast moving data should be in M state as much as possible. A read/write miss by any other thread competing for that line will lead to reverting to S/I which can have significant performance implications. The above example demonstrates how this can achieved by caching stale but usable copies locally to another thread.

Monday, 15 July 2013

Single Producer Single Consumer Queue Revisited: An empiricist tale - part I

{This post is part of a long running series on lock free queues, checkout the full index to get more context here}
Applying lessons learnt about memory layout to prev. discussed SPSC (Single Producer Single Consumer) queue. Tackling and observing some manifestations of false sharing. Getting bitch slapped by reality.

In preparations to implement an MPMC (Many Producers Many Consumers) queue I went back to Martin Thompson's SPSC queue I dissected in detail in this blog post. I was going to use it as the basis for the new queue with a few changes and discuss the transition. In particular an opportunity offered in the implementation of said queue with the addition of the new getAndAdd intrinsic to JDK 8. I was going to... but then I thought, 'let me try a couple more things!' :-).

Where were we?

In case you can't be bothered to go back and read the whole thing again, here's a quick summary. Mr. Thompson open sourced a few samples he discusses in his presentation on lock-free algorithms, in particular a SPSC queue developed across a few stages. I broke down the stages further and benchmarked before and after to explore the effect of each optimisation as it is applied. I then ported the same queue into an off-heap implementation and used it to demonstrate a fast IPC mechanism capable of sending 135M messages per second between processes. The queue demonstrated the following techniques:

Snowman contemplating
evolution

lock free, single writer principle observed.
Set capacity to power of 2, use mask instead of modulo.
Use lazySet instead of volatile write.
Minimize volatile reads by adding local cache fields.
Pad all the hot fields: head, tail, headCache,tailCache to avoid false sharing.

So... what's to improve on? There were a few niggles I had looking at this code again, some I've mentioned before. The last time I benchmarked the original queue implementation I noticed high run to run variance in the results. This was particularly prominent when running across 2 cores on the same socket or across sockets.

To expose the variance I modified the test to produce a summary line (each test runs 20 iterations, the summary is the average of the last 10 test iterations) and ran it 30 times. The results demonstrate the variance (results are sorted to highlight the range, X-axis is the run index, Y-axis is a summary of the run, SC means producer and consumer run on the same core, CC means they run across cores on the same socket):

Original queue performance

OUCH! We get half the performance 20% of the time. The results were very stable within a given run, leading me to believe there was a genuine issue at play.

So I thought, let's poke the i's and kick the t's, see if we can shake the variance out of the bugger.

Terms & Conditions

Benchmarks were carried out on a dual socket Xeon running CentOS 6 and Oracle JDK 1.7u17. Affinity was set using taskset, the scripts used are included with the code as well as the raw results and the LibreOffice spread sheets used for the graphs. Furry, fluffy animals were not harmed and people of different religions and races were treated with the outmost respect and afforded equal opportunity. I ran the cross socket tests, and the data is included, but I chose not to discuss them as no improvement was made on that front and they would make this already long post longer.

Flatten me counters!

To start off, I was not thrilled with the way the counter padding was implemented, for 2 reasons:

By using container classes for the counters we introduce indirection, we could optimise by inlining the data into the queue structure.
The Padded* classes only pad to one side, we are counting on the instances to be laid out together because they are instantiated together. In the right environment I'm pretty sure this can go wrong. By go wrong I mean the instances might get allocated/placed next to data modified elsewhere leading to false sharing. By inlining the counters and forcing strict order we could kill 2 bird with one stone.

To inline the counters, and provide the padding required to provide false-sharing protection I used inheritance to force layout (as outlined previously here). I used Unsafe to get field offset and implement lazySet directly into the fields inlined in my class (this is replacing the original PaddedLong/PaddedAtomicLong, the same method is used in AtomicLong to implement lazySet) the code:

It ain't pretty, but it does the job. The layout can be verified using the excellent java-object-layout tool (I format the output for brevity):

psy.lob.saw.queues.spsc1.SPSCQueue1
offset size type description
0 12 (assumed to be the object header + first field alignment)
   12 4 int ColdFields.capacity
   16 4 int ColdFields.mask
   20 4 Object[] ColdFields.buffer
   24-72 8 long L1Pad.p10-16
   80 8 long TailField.tail
   88-136 8 long L2Pad.p20-p26
144 8 long HeadCache.headCache
152-200 8 long L3Pad.p30-36
208 8 long HeadField.head
216-264 8 long L4Pad.p40-46
272 8 long TailCache.tailCache
280-328 8 long L5Pad.p50-56
  336 (object boundary, size estimate)

Wicked! Run the same tests above to see what happened:

Original vs. Inlined counters(I1)

We get a small improvement when running on the same core (average 3% improvement), but the cross core behaviour is actually worse. Bummer, keep kicking.

Padding the class fields

If we look at the above memory layout, we'll notice the fields capacity, mask and buffer are left flush against the object header. This means that they are open to false sharing with other objects/data allocated on the same cache line. We can add a further love handle on that big boy to cover that front:

Note that by happy coincidence we have already padded the tail side of the fields as part of our flattening exercise. So now the layout is:

psy.lob.saw.queues.spsc3.SPSCQueue3
offset size type description
0 12 (assumed to be the object header + first field alignment)
   12 4 (alignment/padding gap)
   16-72 8 long L0Pad.p00-07
   80 4 int ColdFields.capacity
   84 4 int ColdFields.mask
   88 4 Object[] ColdFields.buffer
   92 4 (alignment/padding gap)
   96-144 8 long L1Pad.p10-16
152 8 long TailField.tail
160-208 8 long L2Pad.p20-26
216 8 long HeadCache.headCache
224-272 8 long L3Pad.p30-36
280 8 long HeadField.head
288-336 8 long L4Pad.p40-46
344 8 long TailCache.tailCache
352-400 8 long L5Pad.p50-56
  408 (object boundary, size estimate)

Try again and we get:

Original vs Inlined counters and padded class(I3)

Same single core behaviour as above, but the cross core behaviour looks stable. Sadly the cross core results are worse than the original in many cases. Good thing there is one more trick up our sleeves.

Padding the data

So, this last step may seem a bit over the top, padding the sides of the buffer as well as spacing out all the bloody fields. Surely you must be joking? I padded the buffer by allocating an extra 32 slots to the array and skipping the first 16 on access. The object layout remains the same, you'll have to imagine the extra padding (code is here). But the results are:

Original vs Inlined counters, padded class and padded data(I4)

Run fat Q! Run! This is a nice kick in the single core category (10% increase) and in the cross core it is pretty flat indeed. So when the original behaves they are much the same, but the average result is a 10% improvement. Very little variance remains.

But it's UGLY!

Got me there, indeed it is. Why do we need to go this far to get rid of flaky behaviour? This is one of

Y do we care?

them times when having a direct line to memory management really helps, and Java has always been about you not having to worry about this sort of thing. There is help on the way in the form of the @Contended annotation which would render the above class much nicer, but even then you will need to pad the buffer by yourself. If you look at how the OffHeapQueue manages this issue, you'll see that less padding is required when you know the data alignment. Sadly there is no @Align(size=byte|short|int|long|cache|page) annotation coming anytime soon, and the @Contended annotation is not clear on how you could go about marking an instance rather than a class/field as contended.

Hang on, what if we do it the other way around?

For all you logical minded people out there who think: "we applied the changes together, but not separately. What if this is all about the data padding? we could fix the original without all this fuss...". I feel your pain obsessive brothers. So I went back and added 2 variants, one of a padded data original(referred to as O2 in the graphs) and another of the padded data and inlined fields without the class field padding. I was surprised by the results:

Same Core comparison of data padding impact

Padding the data, of and by itself made things worse for the original implementation and the inlined implementation when running on the same core. Padding the data sucks!

Cross Core comparison of data padding impact

When we look at the cross core results we can see some benefit from the data padding, suggesting it is part of the issue we are trying to solve, but not the whole story.
Padding the cold fields by itself also did little for the performance, as demonstrated above, but removed some of the variance. The 2 put together however gave us some extra performance, and killed the variance issue. Why? I don't know... But there you go, holistic effects of memory layout at work.

A happy ending?

Well... sort of. You see, all the above tests were run with the following JVM options:

-XX:+UseNUMA -XX:+UseCondCardMark -XX:CompileThreshold=100000

And thus the results present a certain truth, but maybe not all of the truth for everyone all of the time... I decided to waste some electricity and rerun some variations of the above options to see what happens. Running with no JVM options I got the following:

Cross core - no JVM opts

I4 which is the final inlined version is still quite stable, but it's performance lags behind the other implementations.
O1 which is the original implementation has less variance then before (could be luck, who knows) and has the best average result.

Same core - no JVM opts

This time I3 (inlined counters, padded class) is the clear winner.
I2 (inlined counters, padded data) is second, followed by I4 and I1.
When running on the same core, inlining the counters delivers a minor performance win.

Running with -XX:+UseCondCardMark:

ConcCardMark cross core

Using CondCardMark has reduced the range somewhat, but still a fair range for all.
I4 and I3 are stable, but the overall best average score goes to I1 (133M).
I4 is overall slightly better then O1 but worse then O2(original implementation with padded data)

ConcCardMark same core

I1 is the best by quite a bit, followed by I4.

Running with -XX:+UseCondCardMark -XX:CompileThreshold=100000:

-XX:+UseCondCardMark -XX:CompileThreshold=100000 - Cross Core

With the increased compile threshold O2 has pulled to the top followed by I4 and O1.
I4 is looking very much like O1
I1, which was the clear winner a second ago, is now rubbish.

-XX:+UseCondCardMark -XX:CompileThreshold=100000 - Same Core

On the same core we are seeing now the same behaviour we saw in the beginning.
I4 is the clear winner, I1 and I3 pushed down to second place etc.
This is odd, why would giving the JIT more data for compilation push the performance of I1 down?

And for comparison the results from -XX:+UseNUMA -XX:+UseCondCardMark -XX:CompileThreshold=100000 presented together:

All together now - Cross Core

I4 is the most stable, although O2 has the best overall average throughput.
Comparing with all the other combinations we seem to have degraded the performance of other options on our way to find the best option for I4 :(

All together now - Same Core

What does it all mean?

At this point I got quite frustrated. The above approach was producing improvements to the variance, and even an improvement to overall performance on occasion, but the effect was not as decisive and clear as I'd have liked. Some of the variation I just could not shake, even with the best result above I4 is still moving between 110M and 120M and has a spike on either side of this range.

These results are a fine demonstration of how time consuming and error prone the estimation of performance can be. To collect the above results I had to setup the same test to run for each implementation 30 times for each of the affinity options (same core/cross core/cross socket) and repeat
for 4 JVM options combinations (8 impls * 30 runs * 3 affinity setups * 4 JVM_OPTS + extra trial and error runs... = lots of time, quite a bit of data). This result is likely to be different on different hardware, it is likely to change with JDK8, other JVM options and so on. Even with all this effort of collecting data I am very far from having a complete answer. Is this data enough to produce meaningful results/insights at all?

To a certain extent there is evident progress here towards eliminating some of the sources of the run to run variation. Looking at the above results I feel justified in my interpretation of the object layout and the test results. In running the same code on other hardware I've observed good results for I4 and similar variation for the O1, so not all is lost. But this journey is, surprisingly enough, still not over...

More, more, more!

David Hume

If you found this riveting tale of minute implementation changes and their wacky effect on performance absorbing you will love the next chapter in which:

I explore cross socket performance and it's implications
We contemplate the ownership of memory and it's impact
The original implementation is evolved further
We hit a high note with 250M ops per second

This post is long enough as is :-)