Monday, 7 October 2013

SPSC revisited part III - FastFlow + Sparse Data

After I published the last post on this topic (if you want to catchup on the whole series see references below) I got an interesting comment from Pressenna Sockalingasamy who is busy porting the FastFlow library to Java. He shared his current implementation for porting the FastFlow SPSC queue (called FFBuffer) and pointed me at a great paper on it's implementation. The results Pressenna was quoting were very impressive, but when I had a close look at the code I found what I believed to be a concurrency issue, namely there were no memory barriers employed in the Java port. Our exchange is in the comments of the previous post. My thoughts on the significance of memory barriers, and why you can't just not put them in are recorded here.
Apart from omitting the memory barriers the FFBuffer offered two very interesting optimizations:
  1. Simplified offer/poll logic by using the data in the queue as an indicator. This removes the need for tracking the relative head/tail position and using those to detect full/empty conditions.
  2. Sparse use of the queue to reduce contention when the queue is near full or near empty (see musings below). This bloats the queue, but guarantees there are less elements per cache line (this one is an improvement put in by Pressenna himself and not part of the original)
So after getting off my "Memory Barriers are important" high horse, I stopped yapping and got to work seeing what I can squeeze from these new found toys.

FFBuffer - Simplified Logic

The simplified logic is really quite compelling. If you take the time to read the paper, it describes the queue with offer/poll loop I've been using until now as Lamport's circular buffer (the original paper by Lamport is here, not an easy read), full code is here, but the gist is:

I've been using a modified version which I got from a set of examples put forth by Martin Thompson as part of his lock free algos presentation and evolved further as previously discussed(padding is omitted for brevity):

With the FFBuffer variant Pressenna sent my way we no longer need to do all that accounting around the wrap point (padding is omitted for brevity):

The new logic is simpler, has less moving parts (no head/tail caching) and in theory looks like cache coherency traffic should be reduced by only bringing in the array elements which are required in any case.

The Near Empty/Full Queue Problem

At some point while testing the different queues I figured the test harness itself may be to blame for the run to run variance I was seeing. The fun and games I had are a story to be told some other time, but one of the thing that emerged from these variations on the test harness was the conclusion that testing a near empty queue produces very different results. By reversing the thread relationship (Main is producer, other thread is consumer, code here) in the test harness I could see great performance gains, the difference being that the producer was getting a head start in the race, creating more slack to work with. This was happening because the consumer and the producer are experiencing contention and false sharing when the queue has 1 to 16 elements in it. The fact that the consumer is reading a cache row that is being changed is bad enough, but it is also changing the same line by setting the old element to null. The nulling out of elements as they are read is required to avoid memory leaks. Also note that when the consumer/producer run out of space to read/write as indicated by their cached view of the tail/head they will most likely experience a read miss with the implications discussed previously here.
Several solutions present themselves:
  • Batch publication: This is a problem for latency sensitive applications and would require the introduction of some timely flush mechanism to the producer. This is not an issue for throughput focused applications. Implementing this solution will involve the introduction of some batch array on the producer side which is 'flushed' into the queue when full. If the batch array is longer than a cache line the sharing issue is resolved. An alternative solution is to maintain a constant distance between producer and consumer by changing the expectation on the consumer side to stop reading at a given distance.
  • Consumer batch nulling: Delay the nulling out of read elements, this would reduce the contention but could also result in a memory leak if some elements are never nulled out. This also means the near empty queue is still subject to a high level of coherency traffic as the line mutates under the feet of the reader.
  • Padding the used elements: We've taken this approach thus far to avoid contention around fields, we can apply the same within the data buffer itself. This will require us to use only one in N cells in the array. We will be trading space for contention, again. This is the approach taken above and it proves to work well.

Good fences make good neighbours

As mentioned earlier, the code above has no memory barriers, which made it very fast but also incorrect in a couple of rather important ways. Having no memory barriers in place means this queue can:
  • Publish half constructed objects, which is generally not what people expect from a queue. In a way it means that notionally stuff can appear on the other side of the queue before you sent it (because the code is free to be reordered by the compiler). 
  • The code can be reordered such that the consumer ends up in an infinite busy loop (the read value from the array can be hoisted out of the reading loop).
I introduced the memory barriers by making the array read volatile and the array write lazy. Note that in the C variation only a Write Memory Barrier is used, I have implemented variations to that effect but while testing those Pressenna informed me they hang when reading/writing in a busy loop instead of yielding (more on this later). Here's the final version (padding is omitted for brevity):

There are a few subtleties explored here:
  • I'm using Unsafe for array access, this is nothing new and is a cut and paste job from the AtomicReferenceArray JDK class. This means I've opted out of the array bound checks we get when we use arrays in Java, but in this case it's fine since the ring buffer wrapping logic already assures correctness there. This is necessary in order to gain access to getVolatile/putOrdered.
  • I switched Pressanas original field padding method with mine, mostly for consistency but also it's a more stable method of padding fields (read more on memory layout here).
  • I doubled the padding to protect against pre-fetch caused false sharing (padding is omitted above, but have a look at the code).
  • I replaced the POW final field with a constant (ELEMENT_SHIFT). This proved surprisingly significant in this case. Final fields are not optimized as aggressively as constants, partly due to the exploited backdoor in Java allowing the modification of final fields (here's Cliff Click's rant on the topic). ELEMENT_SHIFT is the shift for the sparse data and the shift for elements in the array (inferred from Unsafe.arrayIndexScale(Object[].class)) combined.
How well did this work? Here is a chart comparing it to the the latest queue implementation I had (Y5 with/Y4 without Unsafe array access), with sparse data shift set to 0 and 2 (benchmarked on Ubuntu13.04/JDK7u40/i7-4700MQ@2.40GHz using JVM params: -XX:+UseCondCardMark -XX:CompileThreshold=100000 and pinned to run across cores, results from 30 runs are sorted to make the the charts more readable):
Y=Ops per second, X=benchmark run
Shazzam! This is a damn fine improvement note that the algorithm switch is far less significant than the use of sparse data.
To further examine the source of the gain I benchmarked a variant with the element shift parametrized and a further variant with single padding (same machine and settings as above, sparse shift is 2 for all):
Y=Ops per second, X=benchmark run
Single padding seems to make little difference but parametrization is quite significant.

Applying lessons learnt

Along the way, as I was refining my understanding of the FFBuffer optimizations made above I applied the same to the original implementation I had. Here are the incremental steps:

  1. Y4 - Starting point shown above.
  2. Y5 - Use Unsafe to access array (regression).
  3. Y6 - Use sparse elements in the buffer.
  4. Y7 - Double pad the fields/class/array.
  5. Y8 - Put head/tailCache on same line, put tail/headCache on another line.
  6. Y81 - From Y8, Parametrize sparse shift instead of constant (regression).
  7. Y82 - From Y8, Revert to using array instead of Unsafe access (regression).
  8. Y83 - From Y8, Revert double padding of fields/class/array (regression).

Lets start with the journey from Y4 to Y8:
Y=Ops per second, X=benchmark run
Note the end result is slightly better than the FF one, though slightly less stable. Here is a comparison between Y8 and it's regressions Y81 to Y83:
Y=Ops per second, X=benchmark run
So... what does that tell us?

  • The initial switch to using Unsafe for array access was a regression as a first step (Y5), but proves a major boost in the final version. This is the Y82 line. The step change in its performance I think is down to the JIT compiler lucking out on array boundary check elimination, but I didn't check the assembly to verify.
  • Parametrizing is an issue here as well, more so than for the FFBuffer implementation for some reason.
  • I expected Y83 (single padding) to be the same or worse than Y6. The shape is similar but it is in fact much better. I'm not sure why...
  • There was a significant hiccup in the field layout of Y4/5/6 resulting in the bump we see in going from Y6 to Y7 and the difference between Y83 and Y6. As of now I'm not entirely certain what is in play here. It's clear that pre-fetch is part of it, but I still feel I'm missing something.
So certainly not all is clear in this race to the top but the end result is a shiny one, let's take a closer look at the differences between Y8 and FFO3. Here's a comparison of these 2 with sparse data set to 0,1,2,3,4 running cross cores:
From these results it seems that apart from peaking for the s=2 value, Y8 is generally slower than FFO3. FFO3 seems to improve up to s=3, then drop again for s=4. To me this suggests FFO3 is less sensitive to the effect of the near empty queue as an algorithm.
Next I compared the relative performance of the 2 queue when running on the same core, this time only s=0, 2 where considered:
This demonstrates that FFO3 is superior when threads are running on the same core and still benefits from sparse data when running in that mode. Y8 is less successful and seem to actually degrade in this scenario when sparse data is used. Y8 offers a further optimization direction I have yet to fully explore in the form of batching consumer reads, I will get back to it in a further post.

On Testing And A Word Of Warning

I've been toying around with several test harnesses which are all on GitHub for you to enjoy. The scripts folder also allows running the tests in different modes (you may want to adjust the taskset values to match your architecture). Please have a play. Mostly the harnesses differ in the way they measure the scope of execution and which thread (producer/consumer) acts as the trigger for starting the test and so on. Importantly you will find there is the BusyQueuePerfTest which removes the Thread.yield() call when an unsuccessful offer/poll happens.
During the long period in which this post has been in the making (I was busy, wrote other posts... I have a day job... it took forever) Pressana has not been idle and got back to me with news that all the implementations which utilize Unsafe to access the array (Y5/SPSCQueue5 and later,  most of the FFBuffer implementations) were in fact broken when running the busy spin test and would regularly hang indefinitely. I tested locally and found I could not reproduce the issue at all for the SPSCQueue variations and could only reproduce it for the FFBuffer implementations which used no fences (the original) or only the STORE/STORE barrier (FFO1/2). Between us we have managed to verify that this is not an issue for JDK7u40 which had a great number of concurrency related bug fixes in it, but is an issue for previous versions! I strongly suggest you keep this in mind and test thoroughly if you intend to use this code in production.


Summary

This post has been written over a long period, please forgive any inconsistencies in style. It covers a long period of experimentation and is perhaps not detailed enough... please ask for clarifications in comments and I will do my best to  answer. Ultimately, the source is there for you to examine and play with and has the steps broken down such that a diff between step should be helpful.
This is also now part 5 in a series of long running posts, so if you missed any or forgotten it all by now here they are:
  1. Single Producer/Consumer lock free Queue step by step: This is covering the initial move from tradition JDK queues to the lock free variation used as a basis for most of the other posts
  2. 135 Million messages a second between processes in pure Java: Taking the queue presented in the first post and taking it offheap and demonstrating its use as an IPC transport via a memory mapped file
  3. Single Producer Single Consumer Queue Revisited: An empiricist tale - part I: Looking at run to run variation in performance in the original implementation and trying to cure it by inlining the counter fields into the queue class
  4. SPSC Revisited - part II: Floating vs Inlined Counters: Here I tried to access to impact of inlining beyond fixing up the initial implementations exposure to false sharing on one side.
And here we are :-), mostly done I think.
I would like to thank Pressana for his feedback, contribution and helpful discussion. He has a website with some of his stuff on, he is awesome! As mentioned above the idea for using sparse data is his own (quoting from an e-mail):
"All I tried was to cache align access to the elements in the array as it is done in plenty of c/c++ implementations. I myself use arrays of structs to ensure cache alignment in C. The same principle is applied here in Java where we don't have structs but only an array of pointer. "
Finally, the end result (FFO3/Y8) is incredibly sensitive to code changes as the difference between using a constant and field demonstrates, but also performs very consistently between runs. As such I feel there is little scope for improvement left there... so maybe now I can get on to some other topics ;-).

24 comments:

  1. Would using an AtomicReferenceArray (get / lazySet) instead of Unsafe work out to more-or-less the same performance characteristics?

    ReplyDelete
    Replies
    1. I'd expect it to behave like the array variation, probably worse... but the code is all there, you can try it out and tell me :-)

      Delete
  2. Hello Nitsan,

    Thanks a lot for the article. Meanwhile the last buffer implementation will not work correctly due to the ABA issue.

    If you've got multiply threads which can pull and offer values from your buffer you can face with the following scenario:
    For example:
    Thread A | Thread B | Thread C
    1. take offset for offer (1) | take offset for offer (1) |
    2. update array with a value | waiting for next time slice from the dispatcher |
    3. update tail index with 1 | - // - |
    4. returns true | - // - | take offset for poll (1)
    5. | - // - | update array with null
    6. | update array with a value | update head index (1)
    7. | update tail index with 1 | return A thread value
    8. | returns true |

    Then, you will not be able to get a value added by the Thread B till the index will not make full loop. And that is another problem, because the buffer will just return false on the offer call in this case.

    Thanks

    ReplyDelete
    Replies
    1. Hi Alexander,
      Thanks for reading, glad you liked it.
      All the queues discussed are single producer, single consumer (SPSC)... they will break in many ways if you tried to use them with multiple consumer/producer.
      Sorry if that is not clear in the post.

      Delete
    2. Oh, sorry that is my fault - I've lost the context. And you can also reduce memory consumption if you would replace index shifting (ELEMENT_SHIFT) and additional memory allocation with the bit reversal of the last byte of your indexes. You can define static array of the replacements and use it to reverse bits of the last index bytes.

      Thanks a lot

      Delete
    3. No worries, happens to me all the time :)
      "And you can also reduce memory consumption if you would replace index shifting (ELEMENT_SHIFT) and additional memory allocation with the bit reversal of the last byte of your indexes. You can define static array of the replacements and use it to reverse bits of the last index bytes." - Lost you there... can you give an example of what you mean?

      Delete
    4. Sure. For example you've got an array with 16 elements - the indexes will be 0(0000)-15(1111). And if you reverse its bit you will have the following indexes:
      0 (0000) => 0 (0000)
      1 (0001) => 8 (1000)
      2 (0010) => 4 (0100)
      3 (0011) => 12 (1100)
      ...
      14 (1110) => 7 (0111)
      15 (1111) => 15 (1111)

      As you can see this simple transformation can resolve the False Sharing problem because the distance between elements with near index is equal to a half of reversed range (in our case is 8, but for whole byte it will be 128). So with this simple technique you can resolve the False Sharing problem without any necessity of additional memory allocation.

      Delete
    5. This comment has been removed by the author.

      Delete
    6. And maybe you can speed up your last buffer version with CAS usage instead of two getObjectVolatile and putOrderedObject operations for offering - because the expected value will be null anyway.

      Delete
    7. Thanks for the example, I think I get it.
      I'm willing to try anything [almost :-) ] but:
      1. The bit reversal trick is neat but will make the CPU reads jump all over the place whereas here the read is effectively sequential and very predictable from the CPU memory fetching point of view. Will give it a go when I have a chance, but I'm sceptical.
      2. The last version of offer has 1 volatile read and 1 putOrdered. A volatile read is a MOV from memory that cannot be reordered by the compiler, so very cheap. putOrdered is a MOV to memory that cannot be reordered so also very cheap (unlike a volatile write which is a MOV + LOCK XADD). A CAS is LOCK XCMPCHG, similar to the volatile write in cost. I don't think you can get rid of the volatile read and it seems obvious to me that CAS will be more expensive than the lazySet in place currently.

      It will be a fair few days before I can come back to this, but please feel free to try your ideas out on the code (fork and play) and let me know if I'm provably, absolutely or even just slightly wrong :-)

      Delete
    8. Well, I know that the reversal causes jumps, meanwhile I suppose that it can be smoothed down by L2/L3 caches (in worst case it will require 8x256 = 2 KBytes shift). And you cannot avoid jumps anyway because the buffer keeps references to objects placed in heap.

      Regarding to the CAS vs lazySet, I assume that your implementation can be cheaper then the CAS and your implementation provides better response time from the publisher side. Meanwhile lazySet does not guarantee that the change will be populated as soon as it will be made. As a result your subscriber can fail to busy spin or be parked by a lock because of the value lack.

      In the best case the subscriber will lost just rest of the execution quantum (yield or parkNanos(0)), in worst case the value will be processed in the next time slice. Any of these cases are more expensive than CAS and can explain the performance degradation on the Y8 cases.

      So in terms of operation cost your implementation has better cost, meanwhile in terms of overall performance of the buffer usage the CAS and volatile set can be more effective or at least it will be more stable.

      Delete
    9. The references stored on the heap are a separate issue, nothing I can do about it. The net result of the reversal will need to be benchmarked, no point arguing about something we can measure. Thanks for the idea anyhow.
      Same goes for the CAS vs lazySet, although in that case I'm pretty sure it will cripple the producer performance. I'll implement both at some point and post the results. The cost of CAS/volatile set in terms of flushing the write buffers means none of the producer writes will get batched, I doubt it's a good trade off.
      In terms of single offer/poll latency you may be right, but it's a trade off I would be hard pushed to sell. I guess it depends on your use case.
      Sorry I took so long to reply, I didn't get an e-mail notification about this comment for some reason.

      Delete
  3. Andrew Bissell14 Oct 2013, 19:20:00

    Thanks for the link to Cliff Click's rant about final fields. I find it very unfortunate that the JVM is held back from optimizing finals because of frameworks like Spring which have such a high bloat/benefit ratio.

    ReplyDelete
  4. What do you attribute this improved performance to compared to the queues based on Martin's talk? Is it the unifying of control and data and thus less cache traffic? The disadvantage I see with the nulling an object out approach is there isn't an easy way to process entries in parallel like the Disruptor where every consumer can be fairly independent. The whole time slipping approach is also more complicated and will require tuning to get the numbers right.

    Yet another SPSC implementation that might prove valuable to this series: http://staff.ustc.edu.cn/~bhua/publications/IJPP_draft.pdf - this one claims performance improvement over the FastForward queue and the MCRingBuffer implementation, especially in real life scenarios.

    I wonder how useful it is to benchmark with passing integers btw. Queue performance could change a lot if the structures being passed were meatier and could not fit on the L1 so nicely. Also if the consumer thread actually did some useful processing with the data instead of just discarding it. The paper I linked to argues for more realistic benchmarks like a network processing application. It also suggests measuring with more than one producer/consumer pair running concurrently to see how having multiple queues would scale.

    ReplyDelete
    Replies
    1. 1. I've worked through the chain of changes through 3 posts by now, so the comment is a bit short to cover anything in detail. Main changes: Inlining data structures, properly padding data structures, padding the array, padding the elements in the array, using unsafe to access the array. FF is also a different algorithm. The net result is avoiding any accidental false sharing or false sharing caused by a near empty queue.
      Processing entries in parallel is indeed not part of the Queue interface which I stuck with here. The same techniques can be applied to a more Disruptor like structure. Time slipping is a term used in the article you provided and is not mentioned in the post and is another name for a backoff strategy which is not used here at all beyond yielding on empty/full queue.

      2. The article you offer is interesting, I'll have to take some time to work through it and test the implementation etc. Thanks.

      3. The benchmark is .... a benchmark. It's not perfect and it is not 'real-world'. To compare queue implementations in a well rounded manner I would need a host of tests covering latency/throughput behaviours when near empty/full or constant slack and so on, while contending with different load on cache system etc, etc. Adding more real life computation at the producer/consumer end is subject to use case, and adding large/small objects or allocations on either end are not a universal 'real-life' application... for the purpose of examining the concurrency aspects of the queues having less logic/behaviour to worry about has proved useful to my way of thinking.

      The benchmark has proved a good enough form of comparison for the sake of discussion, it is not a complete assessment of all the different ways in which the queues will help/ruin your application. So apply with caution :-)
      Thanks for reading, and for the interesting article.

      Delete
    2. OK. Read the article, offers some interesting improvements. The algorithm in use is very similar to the one used in SPSC with an improvement on guarding near empty/full conditions. I'll have to give it a go :-).
      I agree with the general call for more real world benchmarks, in particular I think a good variety needs to be considered. A community effort towards that would be awesome.

      Delete
    3. 1. Nitsan, I understand the improvements that you covered in your other articles. I mean't to ask your opinion on the performance improvement from the algorithm change alone. Did Martin's algorithm have any false sharing? AFAIK the producer always wrote to the array and the control variable, whereas the consumer always read from the array and wrote to it's own control variable. Did an almost empty queue cause false sharing some how? It seemed to me that the false sharing was caused by the nulling out of queue entries by the consumer, a problem not suffered by Martin's original queue.

      3. About the benchmark being realistic, I see your point that there are way too many variations to cover them all. I definitely appreciate the work done in this series, to explore theory and to provide real code that can be tested and judged based on particular use cases. What variety of applications are you thinking of? Maybe some of us can contribute applications where the queues are subbed out for performance tests.

      Delete
    4. I look forward to the new interface. Thanks for this series. Keep them coming!

      Delete
    5. 1. Nulling out the entries from the consumer is part of the original implementation. The implementation presented in Martin examples (from his presentation) also suffered from false sharing as the padded counters where only padded on one side. As stressed out previously these are samples which I'm uncomfortable faulting Martin for given he has all these issues sorted in his course material/stuff he actually uses. Note that nulling out the entries (by someone) is necessary to avoid a memory leak.

      3. If people could contribute applications and work loads that would be great. I'm planning to diverge from the Queue interface in the near future as I think I could do more with an interface with separates the producers/consumers explicitly. Fitting that into the applications would require some effort. As I get closer to releasing the queues I will reach out for help. Thanks :-).

      Delete
  5. First of all i want to thank you so much for this awsome code :-)

    I am reading the SPSCQueue8.java and FFBufferOrdered3.java. And i have few questions.

    1) The sparse thingy just flew over my head. Can you explain a little bit more pleaseor point me into where it is covered.

    2) It seems like the queue only stores the pointers in both cases. Can you store normal POJOs in the queue ? is it even worth it ?

    3) Is there a low level CPU code (even via native code) that would *specifically* tell the cpu/compiler that i am accessing a specific set of elements from off heap buffer (in a sequential manner) and that i need few elements to be prefetched in L1 or L2 cache ?

    ReplyDelete
    Replies
    1. No worries :)
      1) The sparse data means instead of using every element in the array we use 1 in every 2/4/8 thus the elements are sparse. The downsize ifs that the array now needs to be 2/4/8 times bigger, the benefit is that when the queue is near empty/full the producer and consumer are far more likely to suffer false sharing as they exchange the last few elements.
      2) How are you planning to store POJOs in the queue? the queues are like any other Java object collection out there, they store references to the objects... I'm not sure I understand your question.
      3) I don't think there are instructions you can use to directly control the pre-fetcher beyond basic configuration (on/off/maybe size). It works based on your memory access patterns (sequential/even strides are easy to predict).
      I'm worried from your question that you somehow think the above article refers to any use of off-heap memory... the one thing you cannot store in off-heap memory is references to objects on the heap, the above queues are on heap and for storing/passing references to on-heap objects.

      Delete
    2. Indeed my main interest is in off heap transfer of normal java objects to a native library as a collection of items. My queues are used to notify the java layer that an object is ready for consumption. However, I understand its not the goal or in scope with these collections.

      Delete
    3. If your queue is off-heap then the struct you use can be inlined into the queue buffer i.e the elements are the structs rather than the references to the structs (like any array of structs in C). This will naturally introduce the sparseness as only so many of these structs will fit in a cache line. You can align each element to the cache line if their size justifies it.

      Delete
  6. 1) The struct are passed around using generated code so i have a am free to introduce cache line awarness which i am not doing now. thanks for the hint :-)

    2) I am looking at _mm_prefetch right now and it might do the prefetching i was looking for ... i will keep you posted when i get around this part of the code. my trget platform is always x86_64. I would apreciate anyone input if anyone reading this blog has played with it.

    ReplyDelete