Thursday, 13 December 2012

Atomic*.lazySet is a performance win for single writers

Summary: For programs respecting the Single Writer principle the Atomic*.lazySet method and it's underlying Unsafe.putOrdered* intrinsic present a performance win in the form of significantly cheaper volatile writes.
[UPDATE 20/12/2016: here's a more recent definition and references on what lzySet/putOrdered is]
A few months ago I attended Martin Thompson's excellent Lock Free Algorithms course, the course walks through some familiar territory for those who have been reading his blog and read through the disruptor, and lots of other goodies which are not. Most of all, the dude himself is both amazingly knowledgeable on all things concurrent, and a clear presenter/teacher on a topic that is confusing and often misunderstood. One of the facilities we utilized during that course, and one that is present under the covers of the disruptor, was lazySet/putOrdered. It was only after the course that I wondered what is that bit of magic and how/why it works. Having talked it over with Martin shortly, and having dug up the treasures of the internet I thought I'd share my findings to highlight the utility of this method.

The origins of lazySet

"In the beginning there was Doug"

And Doug said: "This is a niche method that is sometimes useful when fine-tuning code using non-blocking data structures. The semantics are that the write is guaranteed not to be re-ordered with any previous write, but may be reordered with subsequent operations (or equivalently, might not be visible to other threads) until some other volatile write or synchronizing action occurs)." - Doug Lea is one of the main people behind Java concurrency and the JMM and the man behind the java.util.concurrent package. Carefully reading his definition of lazySet it is not clear that it guarantees much at all of and by itself.
The description of where it might prove useful is also not that encouraging: "The main use case is for nulling out fields of nodes in non-blocking data structures solely for the sake of avoiding long-term garbage retention" - Which implies that if the implementers of lazySet are free to delay the set indefinitely. Nulling out values you don't care about particularly in terms of visibility does not sound like such a hot feature.
The good bit is however saved for last: "lazySet provides a preceeding store-store barrier (which is either a no-op or very cheap on current platforms), but no store-load barrier" - Lets refresh our memory from Doug's cookbook(no muffins there :-(, but lots of crunchy nuggets of wisdom):
StoreStore Barriers: The sequence: Store1; StoreStore; Store2 ensures that Store1's data are visible to other processors (i.e.,flushed to memory) before the data associated with Store2 and all subsequent store instructions. In general, StoreStore barriers are needed on processors that do not otherwise guarantee strict ordering of flushes from write buffers and/or caches to other processors or main memory.

StoreLoad Barriers: The sequence: Store1; StoreLoad; Load2 ensures that Store1's data are made visible to other processors (i.e., flushed to main memory) before data accessed by Load2 and all subsequent load instructions are loaded. StoreLoad barriers protect against a subsequent load incorrectly using Store1's data value rather than that from a more recent store to the same location performed by a different processor. Because of this, on the processors discussed below, a StoreLoad is strictly necessary only for separating stores from subsequent loads of the same location(s) as were stored before the barrier. StoreLoad barriers are needed on nearly all recent multiprocessors, and are usually the most expensive kind. Part of the reason they are expensive is that they must disable mechanisms that ordinarily bypass cache to satisfy loads from write-buffers. This might be implemented by letting the buffer fully flush, among other possible stalls.
We all like cheap(love no-op) and hate expensive when it comes to performance, so we would all like lazySet to be as good as a volatile set, just allot cheaper. A volatile set would require a StoreLoad barrier, which is expensive because it has to make the data available to everyone before we get on with our tasks, and get the latest data in case someone else changed it. This is implicit in the line "protect against a subsequent load incorrectly using Store1's data value rather than that from a more recent store to the same location". But if there is only a single writer we don't need to do that, as we know no one will ever change the data but us.
And from that follows that strictly speaking lazySet is at the very least as correct as a volatile set for a single writer.
At this point the question is when (if at all) will the value set be made visible to other threads.

"Dear Doug"

The Concurrency Interest is an excellent source of informal Q&A with the Java concurrency community and the question I ask above has been answered there by the Doug:
1) Will lazySet write actually happens in some finite time?
The most you can say from the spec is that it will be written no later than at the point that the process must write anything else in the Synchronization Order, if such a point exists. However, independently of the spec, we know that so long as any process makes progress, only a finite number of writes can be delayed. So, yes.
2) If it happens (== we see spin-wait loop finished) -- does it mean,that all writes preceeding lazySet are also done, commited, and visible to thread 2, which finished spin-wait loop?
Yes, although technically, you cannot show this by reference to the Synchronization Order in the current JLS.
...
lazySet basically has the properties of a TSO store
To give credit where credit is due, the man who asked the question is Ruslan Cheremin and if you can read Russian, or what google translate makes of it you can see he was similarly curious about the guarantees provided by lazySet and his inquiry and the bread crumbs it left made my job much easier.
Now that we've established lazySet definitely should work, and that Doug promises us an almost free volatile write for single writers, all we need to quantify is how lazy is lazySet exactly. In Doug's reply he suggests the publication is conditional on further writes being made somehow causing the CPU to flush the store queue at some unknown point in the future. This is not good news if we care about predictable latency.

Lazy, Set, Go!

To demonstrate that lazySet is in fact fine in terms of latency, and to further demonstrate that it is a big win for single writers I put together some experiments. I wanted to demonstrate the low level mechanics behind lock free wait free(no sychronized blocks/locks/wait/notify allowed) inter-thread communications and to do so I re-implemented/trimmed to size AtomicLong as a VolatileLong because we don't need the atomicity offered on set and I also wanted to add a direct(as in not ordered or volatile) setter of the value(the full code is here):
I hid the actual choice of setter by creating a Counter interface with a get and set method. The get always used the volatile getter, as using the direct one results in the values never being propagated. It's included for completeness. The experiments were run with same core and cross core affinity. 

Ping Pong - demonstrate lazySet has good latency charecteristics

We have 2 threads who need to keep pace with each other such that one informs the other of it's current long value, and waits for the other to confirm he got it before incrementing that same value and repeating. In the real world there is a better(as in faster) way to implement this particular requirement of keeping 2 threads in step, but as we are looking at single writers we will make a counter for each thread and maintain the single writer principle. The full code is here, but the core is:
Note that this is a rather contrived behaviour in a lock free wait free program as the threads spin wait on each other, but as a way of measuring latency it works. It also demonstrates that even though no further writes are made after lazySet the value still 'escapes' as required.

Catchup - demonstrate lazySet cost for single writer


One thread is using a counter to mark inserted values into an array. The other thread is reading the value of this counter and scans through the array until it catches up with the counter. This is a typical producer consumer relationship and the low cost of our lazy write is supposed to shine here by not imposing the cost of a full store-load instruction on the writing thread. Note that in this experiment there is no return call from the consumer to the producer. The full code is here, but the core is:

Results and analysis


Sorry for the slightly lame chart, I'll explain:
  • Labels go Experiment -> set method -> affinity, i.e Catchup direct same means it's the result for running the catchup experiment using direct set with affinity set up such that both threads are running on the same core.
  • Yellow bar is maximum run time, orange is median, blue is minimum.
As we can see for the Ping Pong experiment there is practically no difference between the different methods of writing. Latency is fairly stable although volatile performs slightly better in that respect. The Catchup experiment demonstrates the fact that volatile writes are indeed significantly more expensive(5-6 times) then the alternatives.
The curious guest at this party is the direct write. It shouldn't really work at all, and yet not only does it work it also seems like a better performer than lazySet/putOrdered, how come? I'm not sure. It certainly isn't a recommended path to follow, and I have had variations of the experiments hang when using the direct set. The risk here is that we are completely at the mercy of the JIT compiler cleverness not realizing that our set can legally be done to a register rather than a memory location. We also have no guarantees regarding re-ordering, so using it as a write marker as done in catchup is quite likely to break in more complex environments or under closer inspection. It may be worth while using when no happens before guarantee is required for prior writes i.e. for publishing thread metrics or other such independent values, but it is an optimization that might backfire at any time.

Summary:

The lazySet/putOrdered as a means of providing a happens before edge to memory writes is one of the building blocks of the Disruptor and other frameworks. It is a useful and legitimate method of publication and can provide measurable performance improvements as demonstrated.

Further thoughts on related topics...

As part of the data collection for this article I also looked at padded variations of the volatile long used to defend against false sharing and implemented a few variations of those. I went on to implement the same padded and un-padded variations as off heap structures and compared the performance of each, hitting some interesting issues along the way. In the end I decided it is best to keep this post focused and put the next step along into another post, the code is however available for reading on Github and should you find it interesting I'm happy to discuss.

References:

9 comments:

  1. Nice article. I also wrote about the benefits of lazySet for asynchronous logging here: Asynchronous logging versus Memory Mapped Files

    In my article I compare set, lazySet and memory-mapped files when it comes to the latency faced by the producer.

    ReplyDelete
  2. If lazySet is a StoreStore barrier and get() is a LoadLoad, is it possible that the code

    a: ping.lazySet(1)
    b: while (pong.get() != 1);

    gets swapped and thus ping never being set to 1? (On X86, the spec says "Reads may be reordered with older writes to different locations but not with older writes to the same location.")

    ReplyDelete
    Replies
    1. This is a lawyer's game, so in theory this is possible. But the x86 spec is not the issue here, it's the JMM.
      The rules for the barriers as specified in the cookbook leave this open to interpretation, but I believe this is not allowed because it indefinitely delays the store and thus re-ordering will break sequential consistency. This is a good question to pop on the concurrency-interest mailing list.

      Delete
  3. "a StoreLoad is strictly necessary only for separating stores from subsequent loads of the same location(s) as were stored before the barrier"

    The above sentence is at least misleading. The StoreLoad barrier is also *necessary* to prevent reordering of younger loads ahead of older stores to different locations for several algorithms to work correctly.

    For instance, http://preshing.com/20120515/memory-reordering-caught-in-the-act. In the example given in the article, a store-load barrier is required between 'store to x' and 'load from y', where x and y are different memory locations for the program to behave in a sequential consistent manner. You can't solve it by placing any combinations of the other three barriers.

    Also, you need store-load for making the famous Dekker algorithm for mutual exclusion work, where the loads and stores that are separated by the store-load barriers are done to different memory locations.

    ReplyDelete
    Replies
    1. The quote is not mine, it is Doug Lea's: http://gee.cs.oswego.edu/dl/jmm/cookbook.html
      The cookbook is the compiler writer's guide to the JMM and as such considered a definitive source on the JMM implementation details. That is not to say it is beyond error, and it is certainly not free of awkward and misleading sentences.

      Delete
    2. This comment has been removed by the author.

      Delete
  4. Hi Nitsan,

    Very good article. I have a question about ordered write but non-volatile reads.

    For example, I have an integer variable, lets call it `x`. It is defined as non-volatile field.
    Writes to that field is done as ordered through `putOrdered` by increasing it sequentially.
    However, reads are normal reads without any memory barrier.

    My question is that is it guaranteed that reads see the values of `x` as ordered?
    I mean, assume that there is a code like this:
    ```
    int val1 = unsafe.getInt(addressOfX);
    int val2 = unsafe.getInt(addressOfX);
    ```
    Is it possible that reads are reordered by CPU/compiler and `val1` is greater than `val2`?
    I mean reads are reordered so `val1` is the next value and `val2` is the previous value of `x` even though in the code, `val2` is set after `val1`.


    Thanks in advance.

    ReplyDelete
    Replies
    1. It's been a long time since this post was published, so I'll answer, but I'm not going to re-read for context. Please point out any inconsistencies.

      Non-volatile reads can be reordered, eliminated, and hoisted. How they get reordered is entirely dependent on the compiler. At this point your question becomes about weather or not it is possible for 2 reads to be reordered such that val1 load is moved after val2 load, and the answer is yes IMO.
      If the 2 reads are as you describe I think it's probable that the value is always read once(so val1 == val2 in all cases).
      If there's code between the loads, that code can be rearranged such that val2 will sometime happen before val1 (consider unifying code blocks for same condition for instance).
      Normal reads from the thread doing the writing should are fine, assuming it's a single writer, but from other threads there's no guarantee.

      Delete

Note: only a member of this blog may post a comment.