Pages

Thursday, 16 January 2014

JAQ: Using JMH to Benchmark SPSC Queues Latency - Part II

{This post is part of a long running series on lock free queues, checkout the full index to get more context here}
{If you come here looking for JMH related content start at the new and improved JMH Resources Page and branch out from there!}
Just quickly sharing more results/insights from running different configuration of the SPSC latency benchmark discussed in the previous post. The previous post reviewed the different implementations behaviour when sending different sizes of bursts, in this post I'll have a look at the impact of the length of the chain (number of threads passing the messages from one to the other returning to the originator) on latency. The benchmarks and queues stay the same, so you may have to skip back to the previous post for context.

TL;DR

This post may not be of great popular appeal... it's a verification of the behaviour of current implementations of SPSC in terms of single threaded operation costs and RTT latency for different 'trip' lengths. The main finding of interest (for me) is that the sparse data method previously discussed here is proving to be a hindrance to latency performance. This finding is discussed briefly at the end and is something I may dig into more deeply in the near future. Other more minor nuggets were found... but anyhow, lets get to it. 

Chain length 1: The cost of talking to yourself

When I had Darach review the original post he pointed out to me that I neglected to cover this edge case in which a queue is being used by a single thread acting as both the consumer and the producer. The result is interesting as a baseline of operation costs when no concurrency is involved. The benchmark is very simple:


The results give us a sense of the cost of [(offer * burst size) + (poll * burst size)] for the different implementations. I added one of my favourite collections, the ArrayDequeue, to compare the efficiency of a non-thread safe queue implementation with my furry gremlins. Here's the score for AD vs CLQ for the different burst sizes:

This is a bit of a bland presentation, so feel free to call me names (you bland representor of numbers!). The numbers are here if you want to play. All the numbers are in nanos per op, mean error in brackets. Benchmark run on laptop(Ubuntu13.10/JDK7u45/i7-4700MQ@2.4 capped). Here's the score for CAQ/FF2 with sparse shift set to 0 and 2:


Points for consideration:
  • CLQ (ConcurrentLinkedQueue) is the worst. This is completely the wrong tool for the job which is why this result should be a lesson in why using concurrent data structures where you don't need them will suck.
  • Overall AD(ArrayDequeue) is the best from burst size 4 and up. This is the right tool for the job when your consumer and producer are the same thread. Choose the right tool, get the right result.
  • CAQ(ConcurrentArrayQueue) and FF2(FFBufferWithOfferBatch) start off as similar or better than AD, but soon fall behind. Interestingly FF2 starts as the winner of the 3 and ends as the loser with CAQ overtaking it from burst size 64 and onward.

I found it interesting to see how the behaviour changed as the burst size changes, and I thought the single threaded cost estimates were valuable as enablers of other estimates, but overall this is not a benchmark to lose sleep over.
Note the CAQ/FF2 are measured in 2 configurations with and without sparse data. The use of sparse data is counter productive in this use case as it aims to reduce contention that is not there, and inhibits throughput in the process.

Stable Chains: Working across the cores

Due to the limits of my benchmarking machine I can only demonstrate stable behaviour for chains of length 1 to 4 without resorting to pinning threads to cores. Still, the results offer some insight. The benchmark in use is the one discussed previously here. The whole process is pinned such that the only available inter-thread channel is across cores. On my machine I have a hyper-threaded quad-core CPU so 8 logical cores from a taskset perspective. The even/odd numbers are on different cores. For my runs I pinned all other processes to core 0 and pinned the JMH process to different cores (as many as I needed from 1,3,5,7). This is not ideal, but is workable.
I didn't bother with CLQ for this set of results. Just CAQ and FF2. The benchmark code is here.
Here's chain length 2 (T1 -> T2 -> T1):


Here's chain length 3 (T1 -> T2 -> T3 -> T1):

Here's chain length 4 (T1 -> T2 -> T3 -> T4 -> T1):


This is data collected from (10 warmup iterations + 10 iterations) * 10 forks, so benchmarking took a while. This should mean the data is slightly more than anecdotal...
Conclusions:

  • FF2 is consistent proving to be the best queue in terms of inter-thread latency. This advantage is maintained for all burst sizes and chain lengths.
  • Sparse data is only marginally improving results consistently for burst size 1. After that it seems to have a small positive effect on CAQ with chain length 3/4 with small bursts. This may be a quirk of the use case demonstrated by the benchmark but it at least shows this method has a demonstrable down side.
  • The burst RTT does not grow linearly with the burst size, in particular for the smaller bursts. This is due to the bursts filling up the time-to-notify period with write activity.
Here's the latency for FF2(sparse=0) across all chains:

  • Single threaded cost grows linearly with size, this is how we are used to think about cost. It's also a lower boundary for the round trip use case as the round trip requires at least the amount of work done in the single thread case to still be done in the measuring/source thread.
  • Once we increase the chain length we get this initial plateau steadily increasing in slope to become linear again. Note that for the 2 threaded case the costs are nearly the same as the single thread case from burst size 512.

Where to next?

I've had a long break over Xmas, some progress was made on JAQ, but not as much as I'd have liked... I actually spent most of the time away from a keyboard. There are now MPSC/SPMC implementations, and a direct ConcurrentQueue implementation for SPSC hooked into the factory and an MPSC one nearing completion. I've had some interest from different people and am working towards meeting their requirements/needs... we'll get there eventually :-)

Thanks Norman M. for his review (he did tell me to add graphs... my bad)

No comments:

Post a Comment

Note: only a member of this blog may post a comment.