Monday, 11 November 2013

SPSC IV - A look at BQueue

I thought I was done with SPSC queues, but it turns out they were not done with me...
A kind reader, Rajiv Kurian, referred me to this interesting paper on an alternative SPSC implementation called BQueue. The paper compares the BQueue to a naive Lamport queue, the FastFlow queue and the MCRingBuffer queue and claims great performance improvements both in synthetic benchmarks and when utilised in a more real world application. I had a read, ported to Java and compared with my growing collection, here are my notes for your shared joy and bewilderment.

Quick Summary of Paper Intro. Up To Implementation

I enjoyed reading the paper, it's not a hard read if you have the time. If not, here's my brief notes (but read the original, really):
  • Some chit chat goes into why cross core queues are important - agreed.
  • Paper claims many SPSC queues perform very differently in a test-bed vs real world, in particular as real world applications tend to introduce cache traffic beyond the trivial experienced in naive benchmarking - I agree with the sentiment, but I think the "Real World" is a challenging thing to bring into your lab in any case. I would argue that a larger set of synthetic experiments would serve us all better in reviewing queue performance. I also admit my benchmarking thus far has not been anywhere rich enough to cover all I want it to cover. The application presented by the authors is a single sample of the rich ocean of applications out there, who's to say they have found the one shoe which fits all? Having said that, it still strikes me as a good way to benchmark and a valid sample to add to the general "your mileage may vary, measure with your own application" rule.
  • Existing queue implementations require configuration or deadlock prevention mechanisms and are therefore harder to use in a real application -While this argument is partially true for the solutions discussed thus far on this blog, I fear for best results some tuning has to be done in any case. In particular the solution presented by the paper still requires some configuration.
  • Examines Lamport and points out cache line thrashing is caused by cross checking the head/tail variables -  Code for the Lamport variation can be found here.  The variant put forward by Martin Thompson and previously discussed here (with code here) solves this issue by employing producer/consumer local caches of these values. This enhancement and a more involved discussion of cache coherency traffic involved is further discussed here.
  • Examines FastForward queue, which is superior to Lamport but - "cache thrashing still occurs if two buffer slots indexed by head and tail are located in the same cache line". To avoid this issue FF employs Temporal Slipping which is another way of saying a sleep in the consumer if the head/tail are too close. This technique leads to cache thrashing again, and the definition of too close is the configuration mentioned and resented above. - Right, I was not aware of temporal slipping, but I love the term! Next time I'm late anywhere I'll be sure to claim a temporal slippage. More seriously: this strikes me as a streaming biased solution, intentionally introducing latency as the queue becomes near empty/full. From the way it is presented in the article it is also unclear when this method is to be used, as it is obviously not desirable to inflict it on every call.
  • Examine MCRingBuffer, which is a mix of 2 techniques: head/tail caching (like the one employed by Mr. T) and producer batching. The discussion focuses on the issues with producer batching (producer waits for a local buffer to fill before actually writing anything to the queue), in particular the risk of deadlock between 2 threads communicating via such queues - I would have liked the discussion to explore the 2 optimisations separately. The head/tail cache is a very valid optimisation, the producer batching is very problematic due to the deadlock issue described, but also as it introduces arbitrary latency into the pipeline. Some variations on the queue interface support batching by allowing the publisher to 'flush' their write to the queue, which is an interesting feature in this context.
  • Some discussion of deadlock prevention techniques follows with the conclusion that they don't really work.

Looking at the Implementation

So I've been playing with SPSC queues for a while and recently wrote a post covering both the FastFlow queue and the use of sparse data to eliminate the near empty thrashing issue described above. I was curious to see a new algorithm so ported the algorithm from the paper to Java. Note the scary Unsafe use is in order to introduce the required memory barriers (not NOOPs!). Here is what the new algo brings to the table when written on top of my already existing SPSC queues code (code here):

Changes I made to original (beyond porting to Java):
  • Renamed tail/head to be in line with my previous implementations.
  • Renamed the batch constants according to their use.
  • Left the modulo on capacity to the offset calculation and let the counters grow as per my previous implementations (helped me eliminate a bug in the original code).
In a way it's a mix of the FastFlow(thread local use of counters, null check the buffer to verify offer/poll availability) and the MC/Mr.T solutions(use a cached look ahead index value we know to be available a write/read up to that value with no further checks), which I like (yay cross pollination/mongrels... if you are into any variation on racial purity try having children with your cousins and let me know how it went). The big difference lies in the way near empty/full condition are handled. Here's a few points on the algorithm:
  • The cached batchHead/Tail are used to eliminate some of the offer/poll need to null compare we have in FFBuffer. Instead of reading the tail/head every time we probe ahead on the queue. This counts on the null/not-null regions of the queue being continuous:
    • For the producer, if the entry N steps ahead of TAIL (as long as N is less than capacity) is null we can assume all entries up to that entry are also null (lines 4-11). 
    • Similarly we probe ahead as we approach the near empty state, but here we are looking for a not null entry. Given that the not null region is continuous, we probe into the queue ahead of the HEAD, if the entry is null then we have overshot the TAIL. If it is not null we can safely read up to that point with no further check.
  • The BQueue offers us a variation on the above mentioned temporal slip that involves the consumer doing a binary search for a successful probe until it hits the next batch size or declares the queue empty (lines 40 - 49). This is triggered every time the consumer needs to re-calculate the batchHead value having exhausted the known available slots.
  • Every poll probe failure is followed by a spinning wait (line 55-59).
  • Using the buffer elements as the indicator of availability allows the BQueue to avoid reverting to the cache thrashing caused by comparing and sharing the head/tail counters. 
  • The slowdown induced by the algorithm and the spinning wait allows the producer to build up further slack in the queue.  
The temporal slip alternative is the main contribution celebrated in the paper, so it is worth comparing the code snippets offered for all variations.

A Word On Benchmarking

I ran the same benchmark I've been using all along to evaluate SPSC queues. It's synthetic and imperfect. The numbers you get in this benchmark are not the numbers you get in the real world. Your mileage is almost certain to vary. I've been getting some worried responses/twits/comments about this... so why do I use it? 
  • If you have been reading this series of posts you will see that the benchmark has been instrumental in identifying and measuring the differences between implementations, it's been good enough to provide evidence of progress or regression.
  • As a test harness it's been easy to use and examine via other tools. Printing out assembly, analysing hardware counter data, tweaking to a particular format: all easily done with a simple test harness. 
It's imperfect, but practical. I'm increasingly uncomfortable with it as a benchmark because I don't feel the harness scales well to multi-producer examination and I'm in the process of implementing the benchmarks using JMH. As that work is not complete I use what I have.
In the interest of completeness/open for review/honesty/general info, here is how I run benchmarks at the moment:
  1. I'm using my laptop, a Lenovo y510p. It's an i7-4700MQ.
  2. When running benchmarks I pin all running processes to core 0, then run my benchmarks on cores 1-7 (I avoid 1 if I can). It's not as good is isolcpus, but good enough for a quiet run IMHO.
  3. For this current batch of runs I also disabled the turbo-boost (set the frequency to 2.4GHz) to eliminate related variance. In the past I've left it on (careless, but it was not an issue at the time), summer is coming to South Africa and my machine overheats with it on.
  4. I'm running Ubuntu 13.10 64 bit, and using JDK7u45 also 64 bit.
  5. All the benchmarks were run with "-server -XX:+UseCondCardMark -XX:CompileThreshold=100000". I use these parameters for consistency with previous testing. In particular the UseCondCardMark is important in this context. See previous post for more in depth comparison of flags impact on performance.
  6. For this set of results I only examined the cross core behaviour, so the runs were pinned to core 3,7
  7. I do 30 runs of each benchmarks to get an idea of run to run variance.
I include the above as an open invitation for correction and to enable others to reproduce the results should they wish. If anyone has the time and inclination to run the same benchmarks on their own setup and wants to share the data I'd be very interested. If anyone has a suite of queue benchmarks they want to share, that would be great too.
As for the validity of the results as an indication of real world performance... the only indication of real world performance is performance in the real world. Having said that, some problems can be classified as performance bugs and I would argue that false-sharing, and bad choice of instructions falls under that category. All other things being equal I'd expect the same algorithm to perform better when implemented such that it's performance is optimal.


Taking BQueue for a Spin  

And how does it perform? It depends... on  the configuration(what? but they said?! what?). There are 3 parameters to tune in this algorithm:
  1. TICKS - how long to spin for when poll probing fails.
  2. POLL_MAX_BATCH_SIZE - the limit, and initial value for the poll probe.
  3. OFFER_BATCH_SIZE - how far to probe for offer, also note that OFFER_BATCH_SIZE elements will remain unused of the queue capacity (i.e the queue can only fit [capacity - OFFER_BATCH_SIZE] elements before offer fails).
I played around with the parameters to find a good spot, here's some variations (see the legend for parameters, X is run number, Y is ops/sec):
  1. There's no point in setting the batch sizes lower than 32 (that's 2 cache lines in compressed OOPS references) as we are trying to avoid contention. Also, it'll just end up being a confusing branch for the CPU to hit on and the whole point of batching is to get some amortized cost out of it. Even with batch sizes set as low as 32 for both the queue performs well (median is around 261M) but with significant variance.
  2. Upper left chart shows the results as the offer batch is extended (poll batch remains at 32). The larger offer batch offers an improvement. Note that extending the offer batch only gives opportunity for the producer to overtake the consumer as the consumer hits the poll probing logic every 32 steps. A large slack in the queue allows the best throughput (assuming consumer is mostly catching up). The variance also decreases as the offer batch is increased.
  3. Upper right chart shows the results for increasing the poll batch (offer batch remains at 32). As we can see this actually results in worse variance.
  4. Increasing both the offer and the batch size in step (lower left chart) ends up with better overall results, but still quite significant variance
  5. Keeping the offer/poll batch at 8k I varied the spin period which again results in a different profile, but no cure to variance.
For context here is the BQ result (I picked the relatively stable 80,32,8k run) compared with the FFBuffer port and my inlined counters variation on Thompson/Lamport with and without sparse data(Y8=inlined counters,the square brackets are parameters, for FF/Y8 it is the sparse shift so 2 means use every 4th reference):
[NOTE: previous benchmarks for same queues were run with turbo boost on, leading to better but less stable results, please keep in mind when considering previous posts]
As you can see the BQueue certainly kicks ass when compared to the non-sparse data variations, but is less impressive when sparse data is used. Note that BQueue manages to achieve this result with far less memory as it still packs the references densely (note that some memory still goes wasted as mentioned above, but not as much as in the sparse data variations).What I read into the above results:
  1. This algorithm tackles the nearly empty/full queue in an appealing manner. Probing the queue to discover availability is also a means of touching the queue ahead and bringing some required data into the cache.
  2. The main reason for the speed up is the slowing down of the consumer when approaching empty. This serves as a neat solution for a queue that is under constant pressure.
  3. The spinning wait between probes presents a potentially unacceptable behaviour. Consider for instance a consumer who is sampling this queue for data, but needs to get on with other stuff should it find it empty. Alternatively consider a low latency application with bursty traffic such that queues are nearly empty most of the time. 
  4. I'll be posting a further post on latency benchmarking the queues, but currently the results I see (across cores, implementing ping pong with in/out queue) suggest the FF queue offers the best latency(200ns RTT), followed by Y8(300ns) and the BQueue coming in last(750ns). I expect the results to be worst with bursty traffic preventing the batch history from correctly predicting a poll batch size.


Summary

This is an interesting queue/paper to me, importantly because it highlights 2 very valid issues:
  1. Near empty/full queue contention is a significant concern in the design of queues and solving it can bring large performance gains to throughput.
  2. Real application benefits may well differ from synthetic benchmark benefits. To support better decision making a wider variety of tests and contexts needs to be looked at.
I think the end solution is appropriate for applications which are willing to accept the latency penalty incurred by the consumer when hitting a nearly empty queue.  The near full queue guard implemented for this queue can benefit other queue implementations and has no downside that I can see beyond a moderate amount of wasted space. 
Thanks again to Rajiv Kurian for the pointer to this queue, and Norman M. for his help in reviewing this post.




20 comments:

  1. Good stuff Nitsan. Waiting for your entry on other interfaces for inter thread communication. Also curious, how many transactions do you get on the disruptor with your set up where you do not have to create new objects or set old ones to null?

    ReplyDelete
    Replies
    1. I have a benchmark for the Disruptor being used as an SPSC queue here: https://github.com/nitsanw/examples/blob/revisited/src/disruptor/DisruptorSPSCPerfTest.java
      The test is as similar to QueuePerfTest as I could make it, but I urge anyone to review it before reading too much into a comparison. With that benchmark and a locally built disruptor 3.2 jar I get 100-120M ops/sec across cores. Please note that this is a naive use of the Disruptor and ignores many other features it brings to the table.

      Delete
    2. Nitsan .. this was a delightful novel piece of work :-)

      keep it up !

      Delete
    3. Thanks Nitsan. I appreciate the caveats regarding the benchmark in the article. One thing I'd like to point out with the Disruptor and tiny events is that since it uses a couple of longs for synchronization, if the event itself is just an int the synchronization variables causes as much or more cache traffic than the actual payload. With a much larger payload the synchronization costs for the disruptor should be relatively lower. But your conclusions in general make sense, if we can prevent false sharing in queues with no control variables at all, they should be faster than the disruptor for SPSC workloads.

      Delete
    4. I don't think your analysis of the cache traffic is correct. The counters are cache local and only experience a miss(read/write) when the last read value is exhausted. Given that cache coherency happens on the cache line level it would make no difference if the counters were ints (or for that matter shorts), a miss is a line miss. The issue with the counter approach is when the queue is near empty leading to traffic around 3 cache lines instead of just 1 in the case of FF.
      Also note that the Disruptor when used naively as a queue adds a further pointer chase to the queue. The event objects themselves may be subject to false sharing leasing to further issues.

      Delete
    5. This comment has been removed by the author.

      Delete
    6. This comment has been removed by the author.

      Delete
    7. I didn't allege that the fact that the counter is a long and not an int or short is the problem. When you write an event in the disruptor you are writing an int for the event and writing a long for the control. In the FF queue you are just writing an int. The issue I was trying to highlight is that each event data (set on one core) needs to travel to the other core to be processed. As does the counter. The fact that the event size is less than or equal to the counter size means that similar amounts of control information and real event information are having to travel to another core. But if each event were say a large image, then the event information : control information ratio would be smaller, diminishing this particular advantage that the FF queue has - the advantage that less data needs to travel across cores.

      When you say "when the queue is empty", which queue are you talking about? The disruptor? If so why would it have traffic around 3 cache lines when empty but not otherwise? Is it because the cached counter values are no longer good enough to make a decision as to whether to overwrite a particular entry?

      Also why would the event objects be subject to false sharing for the disruptor and not for the other queues? My C++ implementation actually in-lines the array with all the control information so that the event objects are not subject to false sharing.

      Delete
    8. Edit: event information : control information ratio would be bigger.

      Delete
    9. You are comparing the counter size to the event size, I'm just saying the counter size is not significant.
      For a given event (of any size) the disruptor, and the queues discussed above, act as reference queues delivering the reference to the event rather than the event itself. The Disruptor is in fact passing a reference to the event wrapper object which will hold another reference to the actual event (when used naively as a queue) hence the extra pointer chase.
      For the Disruptor, consumers don't cache the producer index like the SPSCQueue implementation, but rather sample it once per batch read of events from the ring buffer (see the BatchEventProcessor). The SingleProducerSequencer does caching in a similar fashion to the SPSCQueue. This means that the control/synchronization data is only exchanged when the locally cached view is exhausted. So once per many events in theory.
      When the queue/disruptor is empty or near empty the consumer view is rapidly exhausted, thus reducing the effectiveness of the batching. To exchange a single event we trigger the following:
      1. Read miss of the tail value
      2. Read miss of the event from the buffer
      3. Potential write miss when nulling out the event in the queue if the producer wrote another event
      The above is not about the size of the event (though inlined events will spare you number 3 when inlined to the buffer) it is due to the queue being near empty. If the queue is near empty all the time because the consumer is faster than the producer for instance than this will become significant.
      False sharing for the disruptor is an extra risk because of the extra indirection of the event wrappers which can experience false sharing between them. This is a worst case scenario when the event wrappers are small enough and laid out densely enough to trigger false sharing when an event is read from one and nulled out/written in the other. This worst case scenario is again about the case where the queue is near empty or near full.

      Delete
    10. Aah this makes sense. I still don't get the nulling out of the event part in your 3 cache line exchange explanation and in the false sharing explanation. I thought in the disruptor framework the consumer never nulls out the event in the queue, the producer merely re-uses it when it's safe to do so.

      Delete
    11. When people use the Disruptor as a generic queue replacement (which is not a good use for it) they will have the ring buffer pre-filled with an event holder. The producer will claim a slot, pop an event into it, the consumer will read the slot get the event, and null it out. If you do not null out the event for generic queues you get a memory leak. So the consumer never nulls out the slot in the ring buffer, but does null out the field in the event wrapper. The event wrappers are allocated all at once, so may well be laid out back to back in memory. They have no padding, so read write operations on them can cause false sharing.

      Delete
    12. Thanks Nitsan. Sorry to be bothersome, but why would there be a memory leak exactly? Could you please point me to an example (maybe on the disruptor src examples) where the consumer actually nulls out the field in the event wrapper.

      Delete
    13. Not at all bothersome, the more you ask the more I have to think.
      In a GC runtime any reference which is not nulled will stop the referred object from being collected. If the polled element from a reference queue is not nulled there is a memory leak. The number of references leaked is bounded by the size of the queue, so in some cases you may consider it acceptable. In a generic reference queue where the size of elements referenced in the queue is unknown it is generally not acceptable.
      The disruptor is not a generic queue, and using it as such is naive. But if you were to use it as such you would have to null out the reference holder. This is not a criticism of the Disruptor, but of inappropriate use of it.

      Delete
    14. In the example that you posted at https://github.com/nitsanw/examples/blob/revisited/src/disruptor/DisruptorSPSCPerfTest.java, where are you nulling events out?
      When you say a generic reference queue, are you talking about a queue where the references held are to elements of different sizes? In your example all the elements of the queue point to ValueEvent references right?
      Is your point that if the references held in the queue point to objects of different kinds we cannot re-use them readily and hence must null them out and create ones. Also when you say the disruptor is not a generic queue, do you mean if the disruptor was used with a poll/offer interface we would need to null out entries? If you use it in a typical manner, where your events are homogenous, you fetch next sequence, modify the object and publish the sequence then would nulling out by the consumer still be a concern?

      Delete
    15. 1. In my example I use it as a queue of ints, there's no nulling anywhere. ValueEvent is a wrapper for a mutable int. If it were a reference queue it would wrap an Object reference.
      2. The size being unbounded and unplanned for is part of the issue. It's about expected behaviour. If you take Runnables off a queue and run them, do you expect their state to hang around after they are done?
      3. I can't comment on the typical use of the Disruptor, but I know some people try and use it as a replacement for a queue. It's not a bad plan as it beats the JDK queues, but it's not the interface or design it was built for IMHO.
      4. The Disruptor doesn't inherently require you to null out anything, only when your events wrap references do you need to start worrying about nulling those references out.

      Delete
  2. Cool stuff. BTW have you looked at SPSC implementations in Boost and 1024cores.net?

    ReplyDelete
    Replies
    1. Glad you like it :-)
      1024.net is an excellent blog, my recollection was that the queue discussed there was an unbounded SPSC which I've not really explored in the post series. Can you point me at the page/code you have in mind?
      Boost SPSC(http://www.boost.org/doc/libs/1_53_0/boost/lockfree/spsc_queue.hpp) is similar to the padded Lamport implementation covered early on this series, but is unbounded (and counters are inlined). As such I don't think it adds anything significant to the discussion at this point. Am I missing something?
      Thanks,
      Nitsan

      Delete
    2. I agree boost SPSC is not optimal. there's a 1024cores blog regarding FF:

      http://www.1024cores.net/home/lock-free-algorithms/queues/fastflow

      Delete