Monday, 19 October 2015

Expanding The Queue interface: Relaxed Queue Access


Continuing from previous post on the expansion of the Queue interface to support new ways of interacting with queues I have gone ahead and implemented relaxedOffer/Poll/Peek for the JCTools queues. This was pretty easy as the original algorithms all required a bit of mangling to support the strong semantic and relaxing it made life easier. Implementing batch/continuous interactions was a bit more involved, but the results were interestingly rewarding. I will save discussing the drain/fill results for a follow up post.
The new interface is here, feedback is welcome.
TL;DR:

  • Throughput is improved in all cases (~5-15% increase)
  • Cost of failure (when offer/poll fail to return/add a value) is dramatically reduced
  • Latency for bursts is improved (as much as 40% reduction)

Relaxed Access API

This ended up a very small API change, most of the difficulty here being choice of name. I went with the 'relaxed' prefix as suggested by J.P.Bempel (as opposed to the 'weak' prefix considered earlier):

Relaxed vs. Non-Relaxed Example

The 'relaxation' of the queue full/empty reporting in these methods allows us to remove an extra load of the 'opposite' index and avoiding waiting around for delayed visibility. Here's how MPSCArrayQueue::poll was changed to become 'relaxed':

So now we can drop lines 11-15, it doesn't look like much. We wouldn't really expect this gap in offer (see offer for details) to happen very often, so the overhead can typically amount to an extra load of the producerIndex and a predictable branch.
Also, since poll on an MPSC is actually cheaper than the offer anyhow, aren't we optimizing a non-issue? Could we not be actually making throughput worse by making the consumer more aggressive?

Throughput Impact Of Relaxation

To test the impact of this change I constructed a JMH benchmark that measures throughput when both consumer and producer spin to offer/poll as fast as they can (see the code here). I ran the benchmark on a Haswell machine(i7-4770 CPU@3.40GHz/Ubuntu/OracleJVM8u60). The benchmarks were pinned to run cross core (no 2 threads share a physical core, i.e. no hyperthreading). Benchmarks were run with 5 forks, 10 warmup iterations (1 second), 10 measured iterations (1 second). The score reports operations per microsecond, or millions of operation per second if you prefer. The throughput is the number of pollsMade, but the other figures are interesting as well.
These results are with a single producer and consumer:
 Type |     1P1C     | Normal± Err  | Relax ± Err  |
 Spsc | offersFailed |   0.0 ±  0.1 |   0.1 ±  0.1 |
 Spsc | offersMade   | 439.1 ±  4.1 | 432.1 ± 10.0 |
 Spsc | pollsFailed  |   0.2 ±  0.5 |   0.1 ±  0.2 |
 Spsc | pollsMade    | 438.9 ±  4.1 | 431.9 ± 10.0 |
 Mpsc | offersFailed |   0.0 ±  0.0 |   0.0 ±  0.0 |
 Mpsc | offersMade   |  22.7 ±  0.6 |  31.9 ±  0.8 |
 Mpsc | pollsFailed  |   0.1 ±  0.1 |   6.6 ±  0.9 |
 Mpsc | pollsMade    |  22.8 ±  0.6 |  31.9 ±  0.8 |
 Spmc | offersFailed |   0.2 ±  0.2 |  81.0 ±  5.4 |
 Spmc | offersMade   |  23.5 ±  0.6 |  26.8 ±  0.6 |
 Spmc | pollsFailed  |   0.1 ±  0.1 |   0.2 ±  0.3 |
 Spmc | pollsMade    |  23.2 ±  0.6 |  26.6 ±  0.6 |
 Mpmc | offersFailed |   0.0 ±  0.0 |   0.0 ±  0.1 |
 Mpmc | offersMade   |  71.4 ±  5.1 |  71.7 ±  4.8 |
 Mpmc | pollsFailed  |   0.0 ±  0.1 |   0.3 ±  0.9 |
 Mpmc | pollsMade    |  71.4 ±  5.1 |  71.7 ±  4.8 |

  • The SPSC case serves as a baseline, and I expected results to be the same. There's a 2% difference which is close to the error reported. Maybe the inlining depth impacted the performance slightly, or perhaps the benchmark is not as stable as I'd like.
  • The MPSC case shows 50% improvement. This is pretty awesome. We also notice the consumer now fails more. This is indicative to the reduced cost of failure to find elements in the queue. The cache miss on the producerIndex was slowing down both the consumer and the producer. This phenomena has been observed in other post, but my best attempt at an explanation is here. The bottom line being that the mutator suffers from losing the exclusive state of the cache line in his own cache, while the reader gets hit with a read miss (going to LLC as the threads are on separate cores).
  • The SPMC case shows 15% improvement on throughput. The cost of failing to offer has gone down dramatically.
  • The MPMC case shows no change, but does show great throughput. This could be down to the very different algorithm in play or just because the MPMC queue is well balanced and thus behaves very well in a balanced use case as above. This may lead you to believe MPMC is a better choice than MPSC, but as the latency results and contended throughput results show this is an anomaly of the benchmark load. Because MPSC presents a much cheaper poll than offer the queue is always empty, making the contention on the queue buffer worse than it would be for MPMC where offer and poll are of similar costs.
Using the command line controls over thread groups sizes (-tg) I ran a few more cases. Did I mention I love JMH?
Here's MPMC and MPSC with 3 producers and 1 consumer:
 Type |     3P1C     | Normal± Err  | Relax ± Err  |
 Mpsc | offersFailed |   0.0 ±  0.0 |   0.1 ±  0.6 | 
 Mpsc | offersMade   |  12.7 ±  0.1 |  13.4 ±  0.1 | 
 Mpsc | pollsFailed  |   4.1 ±  0.5 | 115.3 ±  4.9 | 
 Mpsc | pollsMade    |  12.7 ±  0.1 |  13.4 ±  0.1 | 
 Mpmc | offersFailed |   0.0 ±  0.0 |   0.0 ±  0.0 | 
 Mpmc | offersMade   |  11.1 ±  0.0 |  11.7 ±  0.1 | 
 Mpmc | pollsFailed  |   0.3 ±  0.1 |  60.4 ± 10.0 | 
 Mpmc | pollsMade    |  11.1 ±  0.0 |  11.7 ±  0.1 |
  • MPSC case shows slight improvement to throughput of roughly 5%, but the improvement to cost of failure is very visible here.
  • MPMC case improves by 5% and the improved cost of failure is similarly visible.
Here's MPMC and SPMC with 1 producers and 3 consumer:
 Type |     1P3C     | Normal± Err  | Relax ± Err  |
 Spmc | offersFailed |   1.5 ±  0.2 | 125.9 ±  4.6 |
 Spmc | offersMade   |  13.4 ±  0.1 |  14.0 ±  0.0 |
 Spmc | pollsFailed  |   0.0 ±  0.0 |   0.0 ±  0.0 |
 Spmc | pollsMade    |  13.1 ±  0.1 |  13.7 ±  0.0 |
 Mpmc | offersFailed |  23.9 ±  0.8 | 124.1 ±  3.2 |
 Mpmc | offersMade   |  11.0 ±  0.0 |  12.1 ±  0.0 |
 Mpmc | pollsFailed  |   0.0 ±  0.2 |   1.0 ±  1.6 |
 Mpmc | pollsMade    |  10.7 ±  0.0 |  11.9 ±  0.0 |
  • SPMC case shows  5% improvement and great improvement to failure performance.
  • MPMC case improves by 11% and the improved cost of failure is similarly visible.
Here's MPMC only, with 2 producers and 2 consumers:
 Type |    2P2C      | Normal± Err  | Relax ± Err  |
 Mpmc | offersFailed |   1.4 ±  0.8 |   0.7 ±  0.6 |
 Mpmc | offersMade   |  14.8 ±  0.8 |  16.0 ±  0.6 |
 Mpmc | pollsFailed  |   0.0 ±  0.0 |   0.4 ±  0.5 |
 Mpmc | pollsMade    |  14.7 ±  0.8 |  15.9 ±  0.6 |
  • The improvement to throughput is a nice 8%, but the error is not that far from it so I'm not too confident without lots more runs. This is a similarly symmetric load to the 1P1C and the balance is showing through.
I could go on to higher core counts etc, but I'm actually feeling the win here is pretty conclusive:
  • Throughput is generally improved
  • Failure to make progress is significantly cheaper
Why is this failure mode important? This is particularly important when you have a consumer/producer which is able to make progress in the face of such a failure. A common example of this pattern is a Selector thread which also has an incoming queue of commands to execute, when there's nothing in the queue we would like to go back to selecting on the selector.


Latency impact of Relaxation

Is this going to help us in the case of burst handling latency? There's a benchmark for that too! Or rather there's an estimate of burst processing cost. Code is here. The general idea being to send a burst of messages from the producer, then wait for completion notice from the consumer.
Here goes:
 Size | Type | Normal±Err | Relax± Err |
    1 | Spsc |  126 ±   2 |  126 ±   1 |
    1 | Mpsc |  144 ±   2 |  140 ±   1 |
    1 | Spmc |  187 ±   2 |  189 ±   2 |
    1 | Mpmc |  190 ±   8 |  203 ±   2 |
   10 | Spsc |  193 ±   2 |  192 ±   3 |
   10 | Mpsc |  611 ±  22 |  429 ±   7 |
   10 | Spmc |  309 ±   3 |  307 ±   1 |
   10 | Mpmc |  807 ±  19 |  677 ±  15 |
  100 | Spsc |  530 ±   7 |  528 ±   4 |
  100 | Mpsc | 5243 ± 158 | 3023 ± 107 |
  100 | Spmc | 1240 ±   7 | 1250 ±   6 |
  100 | Mpmc | 4684 ± 382 | 3665 ± 111 |

  • Same as before, SPSC serves as a sanity check/baseline. No change is good.
  • SPMC/MPSC results are pretty similar when sending a single message. MPMC seems slightly regressed.
  • When sending 10/100 messages in a burst we see significant benefits to using the relaxed methods, in particular for the MPSC and MPMC cases.


Summary

Relaxing is good for you, deep down you always knew it. Turns out it ain't half bad for your queues either :-). The new API will be part of the next version of JCTools, coming out as soon as I can work through my Maven mojo phobia.

Thanks J.P. Bempel and Phil Aston for their kind review, remaining errors are a result of me refusing to listen to them persnickety bastards.