Psychosomatic, Lobotomy, Saw: Affinity

Showing posts with label Affinity. Show all posts

Sunday, 7 April 2013

135 Million messages a second between processes in pure Java

{This post is part of a long running series on lock free queues, checkout the full index to get more context here}
Porting an existing single producer/single consumer concurrent queue into an IPC mechanism via memory mapped files and getting 135 million messages throughput in pure Java.
In my previous post I covered a single producer/consumer queue developed and shared by Martin Thompson capable of delivering an amazing 130M messages per second. The queue he delivered is a great tool for communicating between threads, but sometimes communicating between threads is not enough. Sometime you need to leave your JVM and go out of process. Inter Process Communications (IPC) is a different problem to inter thread communications, can it be cracked by the same approach?

IPC, what's the problem?

Inter Process Communication is an old problem and there are many ways to solve it (which I will not discuss here). There are several attractions to specialized IPC solutions for Java:

Faster than socket communication.
An out of process integration option with applications written in other languages.
A means of splitting large VMs to smaller ones improving performance by allowing GC and JIT specialization.

For IPC to be attractive it has to be fast, otherwise you may as well go for network based solutions which would extend beyond your local machine uniformly. I attended an Informatica conference a while back and got talking to Todd Montgomerey about the Disruptor and mechanical sympathy, he suggested that IPC should be able to perform as well as inter thread messaging. I found the idea interesting and originally meant to port the Disruptor, but Martin's queue is simpler (and quicker) so I went for that instead. Starting with a good algorithm/data structure is very good indeed, now I just needed to bridge the gap and see if I can maintain the benefits.

Off the heap we go!

To do IPC we must go off heap. This has several implications for the queue, most importantly references are not supported. Also note persistence to and from the queue is required, though one could extend my implementation to support a zero copy interaction where a struct is acquired, written and committed instead of the offer method, and similarly acquired, read and finally released instead of the poll method. I plan to make several flavours of this queue to test out these ideas in the near future.

My IPC queue uses a memory mapped file as a means of acquiring a chunk of shared memory, there is no intention to use the persisted values though further development in that direction may prove interesting to some. So now that I got me some shared memory, I had to put the queue in it.

I started by laying out the queue counters and cached counters. After realizing the counters need to be aligned to work properly I learnt how to align memory in Java. I went on to verify that aligned memory offers the guarantees required for concurrent access. Quick summary:

aligned access means writing data types to addresses which divide by their size.
unaligned access is not atomic, which is bad for concurrency :(
unaligned access is slow, which is bad for performance :(
unaligned access may not work, depending on OS and architecture. Not working is very bad :(

Sorting out alignment is not such a big deal once you know how it works. One of the nice things about going off-heap was that solving false sharing has become far more straightforward. Move your pointer and you're in the next cache line, job done. This left me rather frustrated me with the tricks required to control memory layout in Java. Going back to the original implementation you will notice the Padded classes who's role it is to offer false sharing protection. They are glorious hacks (with all due respect) made necessary by this lack of control. The @Contended annotation coming in JDK 8 will hopefully remove the need for this.
This is how the memory layout worked out:

To illustrate in glorious ASCII graphics (each - is a byte), this is what the memory layout looks like when broken into cache lines:
|--------|--------|--------|head....|--------|--------|--------|--------|
|--------|--------|--------|tailCach|--------|--------|--------|--------|

|--------|--------|--------|tail----|--------|--------|--------|--------|

|--------|--------|--------|headCach|--------|--------|--------|--------|

I played around with mixing off heap counters with on heap buffer but in the interest of brevity I'll summarize and say the JVM does not like that very much and the end result performance is not as good as all heap/off-heap solutions. The code is available with everything else.

Once alignment and memory layout were sorted I had to give up the flexibility of having reference pointers and settle for writing my data (an integer) directly into the memory. This leaves my queue very restrictive in it's current form. I intend to revisit it and see what I can do to offer a more extendable API on top of it.

Let me summarize the recipe at this point:

Create a memory mapped file large enough to hold:

4 cache lines for counters/cached counters.
4 bytes(per integer) * queue capacity (must be a power of 2).
1 spare cache line to ensure you can align the above to the cache line.

Get a mapped byte buffer, which is a direct byte buffer on top of the mapped memory.
Steal the address and get the contained aligned byte buffer.
Setup pointers to the counters and the beginning of the buffer
Replace use of natural counters with off heap counters accessed via Unsafe using the pointers.
Replace use of array with use of offset pointers into buffer and Unsafe access.
Test and debug until you work out the kinks...

The above code should give you a fair idea how it works out and the rest is here. This queue can work in process and out of process as demonstrated in the tests included in the repository. Now that it works (for the limited use case, and with room for further improvement... but works), is it fast enough? not so fast? is it...<gasp> ... FASTER????!?!?!

Smithers, release the hounds

Here are the numbers for using the different implementations in process:

Implementation/Affinity	Same core	Cross core	Cross socket
P1C1QueueOriginal3	110M	130M	19M
P1C1OffHeapQueue	130M	220M	200M
P1C1QueueOriginalPrimitive	124M	220M	215M

Confused? Let me explain. First line is the measurements taken for the original queue. Similar to what was presented in prev. post, though I saw a slight improvement in the results with increasing the compile threshold to 100000.
The second line is my offheap implementation of same algorithm. It is significantly faster. This is not IPC yet, this is in process. The reason it is faster is because data is inlined in the queue, which means that by loading an entry in the queue we get the data as opposed to a reference to the data. Getting a reference is what you get when you have and Object[] array. The array holds the references and the data is elsewhere, this seems to make it more painful as we get further from the producer.
The last entry is a mutation of P1C1QueueOriginal3 into a primitive array backed queue to compare performance like for like. As you can see this displays very similar results to the off heap implementation supporting the theory that data in-lining is behind the observed performance boost.
The lesson here is an old one, namely that pointer chasing is expensive business further amplified by the distance between the producing CPU and consuming CPU.
The off-heap queue can offer an alternative to native code integration as the consuming thread may interact directly with the off-heap queue and write results back to a different off-heap queue.
Running a similar benchmark adapted to use a memory mapped file as the backing DirectByteBuffer for the off-heap queue we get:
    same core - ops/sec=135M
    across cores - ops/sec=98M
    across sockets - ops/sec=25M

JOY! a pure Java IPC that gives you 135M messages per second is more throughput then you'd get with most commercial products out there. This is still not as fast as the same queue in process and I admit I'm not sure what the source of the performance difference is. Still I am quite happy with it.
A few notes/observations from the experimentation process:

I got a variety of results, stabilizing around different average throughputs. I chose the best for the above summary and plan to go into detail about the results in the near future.
The JVM was launched with: -XX:+UseCondCardMark -XX:CompileThreshold=100000
Removing the Thread.yield from the producer/consumer loops improved performance when running on the same core, but made it worse otherwise.
Moving the queue allocation into the test loop changes the performance profile dramatically.
I've not had time to fully explore the size of the queue as a variable in the experiment but the little I've done suggests it makes a difference, choose the right size for your application.

I realize this post is rather less accessible than the previous one, so if you have any questions please ask.

Sunday, 17 March 2013

Single Producer/Consumer lock free Queue step by step

{This post is part of a long running series on lock free queues, checkout the full index to get more context here}
Reading through a fine tuned lock free single producer/consumer queue. Working through the improvements made and their respective impact.
Back in November, Martin Thompson posted a very nice bit of code on GitHub to accompany his Lock Free presentation. It's a single producer/consumer queue that is very fast indeed. His repository includes the gradual improvements made, which is unusual in software, and offers us some insights into the optimization process. In this post I break down the gradual steps even further, discuss the reasoning behind each step and compare the impact on performance. My fork of his code is to be found here and contains the material for this post and the up and coming post. For this discussion focus on the P1C1QueueOriginal classes.
The benchmarks for this post were run on a dual socket Intel(R) Xeon(R) CPU E5-2630 @ 2.30GHz machine running Linux and OpenJdk 1.7.09. A number of runs were made to unsure measurement stability and then a represantative result from those runs was taken. Affinity was set using taskset. Running on same core is pinned to 2 logical cores on the same physical core. Across cores means pinned to 2 different physical cores on the same socket. Across sockets means pinned to 2 different cores on different sockets.

Starting point: Lock free, single writer principle

The initial version of the queue P1C1QueueOriginal1, while very straight forward in it's implementation already offers us a significant performance improvement and demonstrates the important Single Writer Principle. It is worth while reading and comparing offer/poll with their counter parts in ArrayBlockingQueue.
Running the benchmark for ArrayBlockingQueue on same core/across cores/across sockets yields the following result (run the QueuePerfTest with parameter 0):
same core - ops/sec=9,844,983
across cores - ops/sec=5,946,312 [I got lots of variance on this on, took an average]
across sockets - ops/sec=2,031,953

We expect this degrading scale as we pay more and more for cache traffic. These result are our out of the box JDK baseline.
When we move on to our first cut of the P1C1 Queue we get the following result (run the QueuePerfTest with parameter 1):
same core - ops/sec=24,180,830[2.5xABQ]
across cores - ops/sec=10,572,447[~2xABQ]
across sockets - ops/sec=3,285,411[~1.5xABQ]

Jumping jelly fish! Off to a good start with large improvements on all fronts. At this point we have gained speed at the expense of limiting our scope from multi producer/consumer to single producer/consumer. To go further we will need to show greater love for the machine. Note that the P1C1QueueOriginal1 class is the same as Martin's OneToOneConcurrentArrayQueue.

Lazy loving: lazySet replaces volatile write

As discussed previously on this blog, single writers can get a performance boost by replacing volatile writes with lazySet. We replace the volatile long fields for head and tail with AtomicLong and use get for reads and lazySet for writes in P1C1QueueOriginal12. We now get the following result (run the QueuePerfTest with parameter 12):
same core - ops/sec=48,879,956[2xP1C1.1]
across cores - ops/sec=30,381,175[3xP1C1.1 large variance, average result]
across sockets - ops/sec=10,899,806[3xP1C1.1]

As you may or may not recall, lazySet is a cheap volatile write such that it provides happens-before guarantees for single writers without forcing a drain of the store buffer. This manifests in this case as lower overhead to the thread writing, as well as reduced cache coherency noise as writes are not forced through.

Mask of power: use '& (k pow 2) - 1 instead of %

The next improvement is replacing the modulo operation for wrapping the array index location with a bitwise mask. This 'trick' is also present in ring buffer implementations, Cliff Click CHM and ArrayDeque. Combined with the lazySet improvement this version is Martin's OneToOneConcurrentArrayQueue2 or P1C1QueueOriginal2 in my fork.
The result (run the QueuePerfTest with parameter 2):
same core - ops/sec=86,409,484[1.9xP1C1.12]
across cores - ops/sec=47,262,351[1.6xP1C1.12 large variance, average result]
across sockets - ops/sec=11,731,929[1.1xP1C1.12]

We made a minor trade off here, forcing the queue to have a size which is a power of 2 and we got some excellent mileage out of it. The modulo operator is quite expensive both in terms of cost and in terms of limiting instruction throughput and it is a trick worth employing when the opportunity rises.
So far our love for the underlying architecture is expressed by offering cheap alternatives for expensive instructions. The next step is sympathy for the cache line.
[UPDATE 3/11/2014: See further focused investigation into the merits of different modulo implementations here]

False sharing

False sharing is described elsewhere(here, and later here, and more recently here where the coming of a solution by @Contended annotation is discussed). To summarize, from the CPU cache coherency system perspective if a thread writes to a cache line then it 'owns' it, if another thread then needs to write to the same line they need to exchange ownership. When this happens for writes into different locations the sharing is 'falsely' assumed by the CPU and time is wasted on the exchange. The next step of improvement is made by padding the head and tail fields such that they are not on the same cache line in P1C1QueueOriginal21.
The result (run the QueuePerfTest with parameter 21):
same core - ops/sec=88,709,910[1.02xP1C1.2]
across cores - ops/sec=52,425,315[1.1xP1C1.2]
across sockets - ops/sec=13,025,529[1.2xP1C1.2]

This made less of an impact then previous changes. False sharing is a less deterministic side effect and may manifest differently based on luck of the draw memory placement. We expect code which avoids false sharing to have less variance in performance. To see the variation run the experiment repeatedly, this will result in different memory layout of the allocated queue.

Reducing volatile reads

Common myth regarding volatile reads is that they are free, the next improvement step shows that to be false. In P1C1QueueOriginal22 I have reversed the padding of the AtomicLong (i.e head and tail are back to being plain AtomicLong) and added caching fields for the last read value of head and tail. As these values are only used from a single thread (tailCache is used by consumer, headCache used by producer) they need not be volatile. Their only use is to reduce volatile reads to a minimum. Normal reads, unlike volatile reads are open to greater optimization and may end up in a register, volatile reads are never from a register (i.e always from memory).
The result (run the QueuePerfTest with parameter 22):
same core - ops/sec=99,181,930[1.13xP1C1.2]
across cores - ops/sec=80,288,491[1.6xP1C1.2]
across sockets - ops/sec=17,113,789[1.5xP1C1.2]

By Toutatis!!! This one is a cracker of an improvement. Not having to load the head/tail value from memory as volatile reads makes a massive difference.
The last improvement made is adding the same false sharing guard we had for the head and tail fields around the cache fields. This is required as these are both written at some point and can still cause false sharing, something we all tend to forget can happen to normal fields/data even if it is nothing to do with concurrency. I've added a further implementation P1C1QueueOriginal23 where only the cache fields are padded and not the head and tail. It makes for a slight further improvement, but as the head and tail are still suffering from false sharing it is not a massive step forward.

UPDATE(21/09/2013): As Martin argues in the comments below the above justification on the source of improvement to performance gained by adding the cache fields is incorrect. The performance improvement is caused mostly by the reduction in cache coherency traffic. For an expanded explanation see this later post here. Volatile reads are by no means free, and some of the improvement is due to the reduction in reads, but it is not the main reason for the improvement.

All together now!

The final version P1C1QueueOriginal3 packs together all the improvements made before:

lock free, single writer principle observed. [Trade off: single producer/consumer]
Set capacity to power of 2, use mask instead of modulo. [Trade off: more space than intended]
Use lazySet instead of volatile write.
Minimize volatile reads by adding local cache fields. [Trade off: minor size increment]
Pad all the hot fields: head, tail, headCache,tailCache [Trade off: minor size increment]

The result (run the QueuePerfTest with parameter 3):
same core - ops/sec=110,405,940[1.33xP1C1.2]
across cores - ops/sec=130,982,020[2.6xP1C1.2]
across sockets - ops/sec=19,338,354[1.7xP1C1.2]

To put these results in the context of the ArrayBlocking queue:
same core - ops/sec=110,405,940[11xABQ]
across cores - ops/sec=130,982,020[26xABQ]
across sockets - ops/sec=19,338,354[9xABQ]

This is great improvement indeed, hat off to Mr. Thompson.

Summary, and the road ahead

My intent in this post was to give context and add meaning to the different performance optimizations used in the queue implementation. At that I am just elaborating Martin's work further. If you find the above interesting I recommend you attend one of his courses or talks (if you can find a place).

During the course of running these experiments I encountered great variance between runs, particularly in the case of running across cores. I chose not to explore that aspect in this post and picked representative measurements for the sake of demonstration. To put it another way: your mileage may vary.

Finally, my interest in the queue implementation was largely as a data structure I could port off heap to be used as an IPC messaging mechanism. I have done that and the result are in my fork of Martin's code here. This post evolved out of the introduction to the post about my IPC queue, it grew to the point where I decided they would be better apart, so here we are. The IPC queue is working and achieves similar throughput between processes as demonstrated above... coming soon.
UPDATE: IPC post published here hitting 135M between processes
UPDATE: Run to run variance further explored and eliminated here

Thursday, 20 December 2012

Java ping: a performance baseline utility

Summary: an open source mini utility for establishing a baseline measurement of Java application TCP latency, a short discussion on the value of baseline performance measurements, and a handful of measurements taken using the utility.

We all know and love ping, and in most environments it's available for us as a means of testing the basic TCP network latency between machines. This is extremely useful, but ping is not written in Java and it's also not written with low latency in mind. This is important (or at least I think it is) when examining a Java application and trying to make an informed judgement on observed messaging latency in a given environment/setup.
I'll start with the code and bore you with the philosophy later:

The mechanics should be familiar to anyone who used NIO before, the notable difference from common practice is using NIO non-blocking channels to perform essentially blocking network operations.
The code was heavily 'inspired' by Peter Lawrey's socket performance analysis post and code samples (according to Mr. Lawrey's licence you may have to buy him a pint if you find it useful, I certainly owe him one). I tweaked the implementation to make the client spin as well as the server which improved the latency a bit further. I separated the client and server, added an Ant build to package them with some scripts and so on. Notes:

The server has to be running before the client connects and will shut down when the client disconnects.
Both server and client will eat up a CPU as they both spin in wait for data on the socket channel.
To get the best results pin the process to a core (as per the scripts).

Baseline performance as a useful measure

When measuring performance we often compare the performance of one product to the next. This is especially true when comparing higher level abstraction products which are supposed to remove us from the pain of networking, IO or other such ordinary and 'technical' tasks. It is important however to remember that abstraction comes at a premium, and having a baseline measure for your use case case help determine the premium. To offer a lame metaphor this is not unlike considering the bill of material in the bottom line presented to you by your builder.
While this is not a full blown application, it illustrates the cost/latency inherent in doing TCP networking in Java. Any other cost involved in your application request/response latency needs justifying. It is reasonable to make all sort of compromises when developing software, and indeed there are many a corner to be cut in a 50 line sample that simply would not do in a full blown server application, but the 50 line sample tells us something about the inherent cost. Some of the overhead you may find acceptable for your use case, other times it may not seem acceptable, but having a baseline informs you on the premium.

On the same stack(hardware/JDK/OS) your application will be slower then your baseline measurement, unless it does nothing at all.
If you are using any type of framework, compare the bare bones baseline with your framework baseline to find the basic overhead of the framework (you can use the above to compare with Netty/MINA for instance).
Consider the hardware level functionality of your software to match with baseline performance figures (i.e: sending messages == socket IO, logging == disk IO etc.). If you think a logging framework has little overhead on top of the cost of serializing a byte buffer to disk, think again.

Variety is the spice of life

To demonstrate how one would use this little tool I took it for a ride:

All numbers are in nanoseconds
Tests were run pinned to CPUs, I checked the variation between running on same core, across cores and across sockets
This is RTT(round trip time), not one hop latency(which is RTT/2)
The code prints out a histogram summary of pinging a 32b message. Mean is the average, 50% means 50% of updates had a latency below X, 99%/99.99% in the same vain. (percentiles are commonly used to measure latency SLAs)

To start off I ran it on my laptop(i5/Ubuntu 12.04/JDK7) on loopback, the result was:

Same core: mean=8644.23, 50%=9000, 99%=16000, 99.99%=24000
Cross cores: mean=5809.40, 50%=6000, 99%=9000, 99.99%=23000

Sending and receiving data over loopback is CPU intensive, which is why putting the client and the server on the same core is not a good idea. I went on to run the same on a beefy test environment, which has 2 test machines with tons of power to spare, and a choice of NICs connecting them together directly. The test machine is a dual socket beast so I took the opportunity to run on loopback across sockets:

Cross sockets: mean=12393.97, 50%=13000, 99%=16000, 99.99%=29000
Same socket, same core: mean=11976.68, 50%=12000, 99%=16000, 99.99%=28000
Same socket, cross core: mean=7663.82, 50%=8000, 99%=11000, 99.99%=23000

Testing the connectivity across the network between the 2 machines I compared 2 different 10Gb card and a 1Gb card available on that setup, I won't mention make and model as this is not a vendor shootout:

10Gb A: mean=19746.08, 50%=18000, 99%=26000, 99.99%=38000
10Gb B: mean=30099.29, 50%=30000, 99%=33000, 99.99%=44000
1Gb C: mean=83022.32, 50%=83000, 99%=87000, 99.99%=95000

The above variations in performance are probably familiar to those who do any amount of benchmarking, but may come as a slight shock to those who don't. This is exactly what people mean when they say your mileage may vary :). And this is without checking for further variation by JDK version/vendor, OS etc. There will be variation in the performance depending on all these factors which is why a baseline figure taken from your own environment can provide a useful estimation tool to performance on the same hardware. The above also demonstrates the importance of process affinity when considering latency.

Conclusion

An average RTT latency of 20 microseconds between machines is pretty nice. You can do better by employing better hardware and drivers(kernel bypass), and you can make your outliers disappear by fine tuning JVM options and the OS. At it's core Java networking is pretty darn quick, make sure you squeeze all you can out it. But to do that, you'll need a baseline figure to let you know when you can stop squeezing, and when there's room for improvement.
UPDATE(4/07/2014): I forgot to link this post to it's next chapter where we explore the relative performance of different flavours of the same benchmark using select()/selectNow()/blocking channels/memory mapped files as the ping transport, all nicely packaged for you to play with ;-).

Thursday, 13 December 2012

Atomic*.lazySet is a performance win for single writers

Summary: For programs respecting the Single Writer principle the Atomic*.lazySet method and it's underlying Unsafe.putOrdered* intrinsic present a performance win in the form of significantly cheaper volatile writes.
[UPDATE 20/12/2016: here's a more recent definition and references on what lzySet/putOrdered is]
A few months ago I attended Martin Thompson's excellent Lock Free Algorithms course, the course walks through some familiar territory for those who have been reading his blog and read through the disruptor, and lots of other goodies which are not. Most of all, the dude himself is both amazingly knowledgeable on all things concurrent, and a clear presenter/teacher on a topic that is confusing and often misunderstood. One of the facilities we utilized during that course, and one that is present under the covers of the disruptor, was lazySet/putOrdered. It was only after the course that I wondered what is that bit of magic and how/why it works. Having talked it over with Martin shortly, and having dug up the treasures of the internet I thought I'd share my findings to highlight the utility of this method.

The origins of lazySet

"In the beginning there was Doug"

And Doug said: "This is a niche method that is sometimes useful when fine-tuning code using non-blocking data structures. The semantics are that the write is guaranteed not to be re-ordered with any previous write, but may be reordered with subsequent operations (or equivalently, might not be visible to other threads) until some other volatile write or synchronizing action occurs)." - Doug Lea is one of the main people behind Java concurrency and the JMM and the man behind the java.util.concurrent package. Carefully reading his definition of lazySet it is not clear that it guarantees much at all of and by itself.
The description of where it might prove useful is also not that encouraging: "The main use case is for nulling out fields of nodes in non-blocking data structures solely for the sake of avoiding long-term garbage retention" - Which implies that if the implementers of lazySet are free to delay the set indefinitely. Nulling out values you don't care about particularly in terms of visibility does not sound like such a hot feature.
The good bit is however saved for last: "lazySet provides a preceeding store-store barrier (which is either a no-op or very cheap on current platforms), but no store-load barrier" - Lets refresh our memory from Doug's cookbook(no muffins there :-(, but lots of crunchy nuggets of wisdom):

StoreStore Barriers: The sequence: Store1; StoreStore; Store2 ensures that Store1's data are visible to other processors (i.e.,flushed to memory) before the data associated with Store2 and all subsequent store instructions. In general, StoreStore barriers are needed on processors that do not otherwise guarantee strict ordering of flushes from write buffers and/or caches to other processors or main memory.

StoreLoad Barriers: The sequence: Store1; StoreLoad; Load2 ensures that Store1's data are made visible to other processors (i.e., flushed to main memory) before data accessed by Load2 and all subsequent load instructions are loaded. StoreLoad barriers protect against a subsequent load incorrectly using Store1's data value rather than that from a more recent store to the same location performed by a different processor. Because of this, on the processors discussed below, a StoreLoad is strictly necessary only for separating stores from subsequent loads of the same location(s) as were stored before the barrier. StoreLoad barriers are needed on nearly all recent multiprocessors, and are usually the most expensive kind. Part of the reason they are expensive is that they must disable mechanisms that ordinarily bypass cache to satisfy loads from write-buffers. This might be implemented by letting the buffer fully flush, among other possible stalls.

We all like cheap(love no-op) and hate expensive when it comes to performance, so we would all like lazySet to be as good as a volatile set, just allot cheaper. A volatile set would require a StoreLoad barrier, which is expensive because it has to make the data available to everyone before we get on with our tasks, and get the latest data in case someone else changed it. This is implicit in the line "protect against a subsequent load incorrectly using Store1's data value rather than that from a more recent store to the same location". But if there is only a single writer we don't need to do that, as we know no one will ever change the data but us.
And from that follows that strictly speaking lazySet is at the very least as correct as a volatile set for a single writer.
At this point the question is when (if at all) will the value set be made visible to other threads.

"Dear Doug"

The Concurrency Interest is an excellent source of informal Q&A with the Java concurrency community and the question I ask above has been answered there by the Doug:

1) Will lazySet write actually happens in some finite time?
The most you can say from the spec is that it will be written no later than at the point that the process must write anything else in the Synchronization Order, if such a point exists. However, independently of the spec, we know that so long as any process makes progress, only a finite number of writes can be delayed. So, yes.
2) If it happens (== we see spin-wait loop finished) -- does it mean,that all writes preceeding lazySet are also done, commited, and visible to thread 2, which finished spin-wait loop?
Yes, although technically, you cannot show this by reference to the Synchronization Order in the current JLS.
...
lazySet basically has the properties of a TSO store

To give credit where credit is due, the man who asked the question is Ruslan Cheremin and if you can read Russian, or what google translate makes of it you can see he was similarly curious about the guarantees provided by lazySet and his inquiry and the bread crumbs it left made my job much easier.
Now that we've established lazySet definitely should work, and that Doug promises us an almost free volatile write for single writers, all we need to quantify is how lazy is lazySet exactly. In Doug's reply he suggests the publication is conditional on further writes being made somehow causing the CPU to flush the store queue at some unknown point in the future. This is not good news if we care about predictable latency.

Lazy, Set, Go!

To demonstrate that lazySet is in fact fine in terms of latency, and to further demonstrate that it is a big win for single writers I put together some experiments. I wanted to demonstrate the low level mechanics behind lock free wait free(no sychronized blocks/locks/wait/notify allowed) inter-thread communications and to do so I re-implemented/trimmed to size AtomicLong as a VolatileLong because we don't need the atomicity offered on set and I also wanted to add a direct(as in not ordered or volatile) setter of the value(the full code is here):
I hid the actual choice of setter by creating a Counter interface with a get and set method. The get always used the volatile getter, as using the direct one results in the values never being propagated. It's included for completeness. The experiments were run with same core and cross core affinity.

Ping Pong - demonstrate lazySet has good latency charecteristics

We have 2 threads who need to keep pace with each other such that one informs the other of it's current long value, and waits for the other to confirm he got it before incrementing that same value and repeating. In the real world there is a better(as in faster) way to implement this particular requirement of keeping 2 threads in step, but as we are looking at single writers we will make a counter for each thread and maintain the single writer principle. The full code is here, but the core is:
Note that this is a rather contrived behaviour in a lock free wait free program as the threads spin wait on each other, but as a way of measuring latency it works. It also demonstrates that even though no further writes are made after lazySet the value still 'escapes' as required.

Catchup - demonstrate lazySet cost for single writer

One thread is using a counter to mark inserted values into an array. The other thread is reading the value of this counter and scans through the array until it catches up with the counter. This is a typical producer consumer relationship and the low cost of our lazy write is supposed to shine here by not imposing the cost of a full store-load instruction on the writing thread. Note that in this experiment there is no return call from the consumer to the producer. The full code is here, but the core is:

Results and analysis

Sorry for the slightly lame chart, I'll explain:

Labels go Experiment -> set method -> affinity, i.e Catchup direct same means it's the result for running the catchup experiment using direct set with affinity set up such that both threads are running on the same core.
Yellow bar is maximum run time, orange is median, blue is minimum.

As we can see for the Ping Pong experiment there is practically no difference between the different methods of writing. Latency is fairly stable although volatile performs slightly better in that respect. The Catchup experiment demonstrates the fact that volatile writes are indeed significantly more expensive(5-6 times) then the alternatives.
The curious guest at this party is the direct write. It shouldn't really work at all, and yet not only does it work it also seems like a better performer than lazySet/putOrdered, how come? I'm not sure. It certainly isn't a recommended path to follow, and I have had variations of the experiments hang when using the direct set. The risk here is that we are completely at the mercy of the JIT compiler cleverness not realizing that our set can legally be done to a register rather than a memory location. We also have no guarantees regarding re-ordering, so using it as a write marker as done in catchup is quite likely to break in more complex environments or under closer inspection. It may be worth while using when no happens before guarantee is required for prior writes i.e. for publishing thread metrics or other such independent values, but it is an optimization that might backfire at any time.

Summary:

The lazySet/putOrdered as a means of providing a happens before edge to memory writes is one of the building blocks of the Disruptor and other frameworks. It is a useful and legitimate method of publication and can provide measurable performance improvements as demonstrated.

Further thoughts on related topics...

As part of the data collection for this article I also looked at padded variations of the volatile long used to defend against false sharing and implemented a few variations of those. I went on to implement the same padded and un-padded variations as off heap structures and compared the performance of each, hitting some interesting issues along the way. In the end I decided it is best to keep this post focused and put the next step along into another post, the code is however available for reading on Github and should you find it interesting I'm happy to discuss.

References:

The code can be found here : https://github.com/nitsanw/psy-lob-saw
Concurrency Interest mailing list
Single Writer principle article on Martin's blog
Volatile keyword explained
Memory Barriers in Java
JMM Implementation Cookbook
Broader discussion of memory barriers
Cost of volatile reads

Wednesday, 5 December 2012

Experimentation Notes: on herding processes and CPUs

Summary: notes on cpu affinity and c-state management from recent benchmarking efforts.
Your OS assumes your machine is used for a mix of activities, all of which have the same priority roughly and all must stand in line to get at limited resources. This is mostly terrific and lets us multi-task to our inner ADHD child's content. There are times however when you might want to exercise some control over who goes where and uses what, here's some notes to assist on this task. This is not a proper way to do this(it will all go away on restart for one), I'm not much of a UNIX guru, this is rather an informal cheat sheet to help you get where you are going.

IRQ Balance

Stop the OS from sending interrupts fairly:

service irqbalance stop

CPU power saving - cpufreq

Your OS is trying to be green and put your CPUs to sleep when it thinks you are not using them. Good for some but not when you are in a hurry:

echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

When you are done set it back to scaling the frequency on demand:

echo ondemand | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

Pin processes to CPU cores

To put all you processes on a particular cpu mask(0 in this example):

for i in `ps -eo pid` ; do sudo taskset -pc 0 $i ; done

When you are done you can let them roam again:

for i in `ps -eo pid` ; do sudo taskset -pc 0-3 $i ; done

This is useful when benchmarking, everybody moves to one core and you taskset your benchmarking process onto the cores left. Note that some processes may refuse to move. If you are in a NUMA environment you might have to use numactl instead.