The new interface is here, feedback is welcome.
TL;DR:
- Throughput is improved in all cases (~5-15% increase)
- Cost of failure (when offer/poll fail to return/add a value) is dramatically reduced
- Latency for bursts is improved (as much as 40% reduction)
Relaxed Access API
This ended up a very small API change, most of the difficulty here being choice of name. I went with the 'relaxed' prefix as suggested by J.P.Bempel (as opposed to the 'weak' prefix considered earlier):
Relaxed vs. Non-Relaxed Example
The 'relaxation' of the queue full/empty reporting in these methods allows us to remove an extra load of the 'opposite' index and avoiding waiting around for delayed visibility. Here's how MPSCArrayQueue::poll was changed to become 'relaxed':
So now we can drop lines 11-15, it doesn't look like much. We wouldn't really expect this gap in offer (see offer for details) to happen very often, so the overhead can typically amount to an extra load of the producerIndex and a predictable branch.
Also, since poll on an MPSC is actually cheaper than the offer anyhow, aren't we optimizing a non-issue? Could we not be actually making throughput worse by making the consumer more aggressive?
So now we can drop lines 11-15, it doesn't look like much. We wouldn't really expect this gap in offer (see offer for details) to happen very often, so the overhead can typically amount to an extra load of the producerIndex and a predictable branch.
Also, since poll on an MPSC is actually cheaper than the offer anyhow, aren't we optimizing a non-issue? Could we not be actually making throughput worse by making the consumer more aggressive?
Throughput Impact Of Relaxation
To test the impact of this change I constructed a JMH benchmark that measures throughput when both consumer and producer spin to offer/poll as fast as they can (see the code here). I ran the benchmark on a Haswell machine(i7-4770 CPU@3.40GHz/Ubuntu/OracleJVM8u60). The benchmarks were pinned to run cross core (no 2 threads share a physical core, i.e. no hyperthreading). Benchmarks were run with 5 forks, 10 warmup iterations (1 second), 10 measured iterations (1 second). The score reports operations per microsecond, or millions of operation per second if you prefer. The throughput is the number of pollsMade, but the other figures are interesting as well.
These results are with a single producer and consumer:
Type | 1P1C | Normal± Err | Relax ± Err |
Spsc | offersFailed | 0.0 ± 0.1 | 0.1 ± 0.1 |
Spsc | offersMade | 439.1 ± 4.1 | 432.1 ± 10.0 |
Spsc | pollsFailed | 0.2 ± 0.5 | 0.1 ± 0.2 |
Spsc | pollsMade | 438.9 ± 4.1 | 431.9 ± 10.0 |
Mpsc | offersFailed | 0.0 ± 0.0 | 0.0 ± 0.0 |
Mpsc | offersMade | 22.7 ± 0.6 | 31.9 ± 0.8 |
Mpsc | pollsFailed | 0.1 ± 0.1 | 6.6 ± 0.9 |
Mpsc | pollsMade | 22.8 ± 0.6 | 31.9 ± 0.8 |
Spmc | offersFailed | 0.2 ± 0.2 | 81.0 ± 5.4 |
Spmc | offersMade | 23.5 ± 0.6 | 26.8 ± 0.6 |
Spmc | pollsFailed | 0.1 ± 0.1 | 0.2 ± 0.3 |
Spmc | pollsMade | 23.2 ± 0.6 | 26.6 ± 0.6 |
Mpmc | offersFailed | 0.0 ± 0.0 | 0.0 ± 0.1 |
Mpmc | offersMade | 71.4 ± 5.1 | 71.7 ± 4.8 |
Mpmc | pollsFailed | 0.0 ± 0.1 | 0.3 ± 0.9 |
Mpmc | pollsMade | 71.4 ± 5.1 | 71.7 ± 4.8 |
- The SPSC case serves as a baseline, and I expected results to be the same. There's a 2% difference which is close to the error reported. Maybe the inlining depth impacted the performance slightly, or perhaps the benchmark is not as stable as I'd like.
- The MPSC case shows 50% improvement. This is pretty awesome. We also notice the consumer now fails more. This is indicative to the reduced cost of failure to find elements in the queue. The cache miss on the producerIndex was slowing down both the consumer and the producer. This phenomena has been observed in other post, but my best attempt at an explanation is here. The bottom line being that the mutator suffers from losing the exclusive state of the cache line in his own cache, while the reader gets hit with a read miss (going to LLC as the threads are on separate cores).
- The SPMC case shows 15% improvement on throughput. The cost of failing to offer has gone down dramatically.
- The MPMC case shows no change, but does show great throughput. This could be down to the very different algorithm in play or just because the MPMC queue is well balanced and thus behaves very well in a balanced use case as above. This may lead you to believe MPMC is a better choice than MPSC, but as the latency results and contended throughput results show this is an anomaly of the benchmark load. Because MPSC presents a much cheaper poll than offer the queue is always empty, making the contention on the queue buffer worse than it would be for MPMC where offer and poll are of similar costs.
Here's MPMC and MPSC with 3 producers and 1 consumer:
Type | 3P1C | Normal± Err | Relax ± Err |
Mpsc | offersFailed | 0.0 ± 0.0 | 0.1 ± 0.6 |
Mpsc | offersMade | 12.7 ± 0.1 | 13.4 ± 0.1 |
Mpsc | pollsFailed | 4.1 ± 0.5 | 115.3 ± 4.9 |
Mpsc | pollsMade | 12.7 ± 0.1 | 13.4 ± 0.1 |
Mpmc | offersFailed | 0.0 ± 0.0 | 0.0 ± 0.0 |
Mpmc | offersMade | 11.1 ± 0.0 | 11.7 ± 0.1 |
Mpmc | pollsFailed | 0.3 ± 0.1 | 60.4 ± 10.0 |
Mpmc | pollsMade | 11.1 ± 0.0 | 11.7 ± 0.1 |
Mpsc | offersFailed | 0.0 ± 0.0 | 0.1 ± 0.6 |
Mpsc | offersMade | 12.7 ± 0.1 | 13.4 ± 0.1 |
Mpsc | pollsFailed | 4.1 ± 0.5 | 115.3 ± 4.9 |
Mpsc | pollsMade | 12.7 ± 0.1 | 13.4 ± 0.1 |
Mpmc | offersFailed | 0.0 ± 0.0 | 0.0 ± 0.0 |
Mpmc | offersMade | 11.1 ± 0.0 | 11.7 ± 0.1 |
Mpmc | pollsFailed | 0.3 ± 0.1 | 60.4 ± 10.0 |
Mpmc | pollsMade | 11.1 ± 0.0 | 11.7 ± 0.1 |
- MPSC case shows slight improvement to throughput of roughly 5%, but the improvement to cost of failure is very visible here.
- MPMC case improves by 5% and the improved cost of failure is similarly visible.
Here's MPMC and SPMC with 1 producers and 3 consumer:
Type | 1P3C | Normal± Err | Relax ± Err |Spmc | offersFailed | 1.5 ± 0.2 | 125.9 ± 4.6 |
Spmc | offersMade | 13.4 ± 0.1 | 14.0 ± 0.0 |
Spmc | pollsFailed | 0.0 ± 0.0 | 0.0 ± 0.0 |
Spmc | pollsMade | 13.1 ± 0.1 | 13.7 ± 0.0 |
Mpmc | offersFailed | 23.9 ± 0.8 | 124.1 ± 3.2 |
Mpmc | offersMade | 11.0 ± 0.0 | 12.1 ± 0.0 |
Mpmc | pollsFailed | 0.0 ± 0.2 | 1.0 ± 1.6 |
Mpmc | pollsMade | 10.7 ± 0.0 | 11.9 ± 0.0 |
- SPMC case shows 5% improvement and great improvement to failure performance.
- MPMC case improves by 11% and the improved cost of failure is similarly visible.
Type | 2P2C | Normal± Err | Relax ± Err |
Mpmc | offersFailed | 1.4 ± 0.8 | 0.7 ± 0.6 |
Mpmc | offersMade | 14.8 ± 0.8 | 16.0 ± 0.6 |
Mpmc | pollsFailed | 0.0 ± 0.0 | 0.4 ± 0.5 |
Mpmc | pollsMade | 14.7 ± 0.8 | 15.9 ± 0.6 |
- The improvement to throughput is a nice 8%, but the error is not that far from it so I'm not too confident without lots more runs. This is a similarly symmetric load to the 1P1C and the balance is showing through.
- Throughput is generally improved
- Failure to make progress is significantly cheaper
Latency impact of Relaxation
Is this going to help us in the case of burst handling latency? There's a benchmark for that too! Or rather there's an estimate of burst processing cost. Code is here. The general idea being to send a burst of messages from the producer, then wait for completion notice from the consumer.
Here goes:
Size | Type | Normal±Err | Relax± Err |
1 | Spsc | 126 ± 2 | 126 ± 1 |
1 | Mpsc | 144 ± 2 | 140 ± 1 |
1 | Spmc | 187 ± 2 | 189 ± 2 |
1 | Mpmc | 190 ± 8 | 203 ± 2 |
10 | Spsc | 193 ± 2 | 192 ± 3 |
10 | Mpsc | 611 ± 22 | 429 ± 7 |
10 | Spmc | 309 ± 3 | 307 ± 1 |
10 | Mpmc | 807 ± 19 | 677 ± 15 |
100 | Spsc | 530 ± 7 | 528 ± 4 |
100 | Mpsc | 5243 ± 158 | 3023 ± 107 |
100 | Spmc | 1240 ± 7 | 1250 ± 6 |
100 | Mpmc | 4684 ± 382 | 3665 ± 111 |
Here goes:
Size | Type | Normal±Err | Relax± Err |
1 | Spsc | 126 ± 2 | 126 ± 1 |
1 | Mpsc | 144 ± 2 | 140 ± 1 |
1 | Spmc | 187 ± 2 | 189 ± 2 |
1 | Mpmc | 190 ± 8 | 203 ± 2 |
10 | Spsc | 193 ± 2 | 192 ± 3 |
10 | Mpsc | 611 ± 22 | 429 ± 7 |
10 | Spmc | 309 ± 3 | 307 ± 1 |
10 | Mpmc | 807 ± 19 | 677 ± 15 |
100 | Spsc | 530 ± 7 | 528 ± 4 |
100 | Mpsc | 5243 ± 158 | 3023 ± 107 |
100 | Spmc | 1240 ± 7 | 1250 ± 6 |
100 | Mpmc | 4684 ± 382 | 3665 ± 111 |
- Same as before, SPSC serves as a sanity check/baseline. No change is good.
- SPMC/MPSC results are pretty similar when sending a single message. MPMC seems slightly regressed.
- When sending 10/100 messages in a burst we see significant benefits to using the relaxed methods, in particular for the MPSC and MPMC cases.
Summary
Relaxing is good for you, deep down you always knew it. Turns out it ain't half bad for your queues either :-). The new API will be part of the next version of JCTools, coming out as soon as I can work through my Maven mojo phobia.
Thanks J.P. Bempel and Phil Aston for their kind review, remaining errors are a result of me refusing to listen to them persnickety bastards.
No comments:
Post a Comment
Note: only a member of this blog may post a comment.