Monday 19 January 2015

MPMC: The Multi Multi Queue vs. CLQ

{This post is part of a long running series on lock free queues, checkout the full index to get more context here}
JCTools, which is my spandex side project for lock-free queues and other animals, contains a lovely gem of a queue called the MpmcArrayQueue. It is a port of an algorithm put forward by D. Vyukov (the lock free ninja) which I briefly discussed in a previous post on ring-buffer queues.
The implementation is covered to a large extent in that post, but no performance numbers are provided to evaluate this queue vs. the JDK's own MPMC queue the illustrious ConcurentLinkedQueue (CLQ for short). Let's fix that!

Welcome to the Machine

Before we start the party, we must establish the test bed for these experiments. The results will only be relevant for similar setups and will become less indicative the more we deviate from this environment. So here it is, the quiet performance testing machine (shitloads of power and nothing else running):
  • OS: CentOS 6.4
  • CPU: Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz, dual socket, each CPU has 12 cores (24 logical core as HT is on). This is an Ivy Bridge, so decent hardware if not the fanciest/latest.
  • JVM: Oracle JDK 1.8u25
  • Load average (idle): 0.00, 0.02, 0.00 -> it's oh so quiet....
I'll be using taskset to pin threads to a given topology of producers/consumers which is important because:
  • Running 2 logical threads on same physical core will share queue in L1 cache.
  • Running on 2 physical cores will share queue in LLC.
  • Running on different socket (i.e 2 different CPUs) will force shared data via QPI.
  • Each topology has it's own profile, there's no point to averaging the lot together.
For the sake of these benchmarks I'll be pinning the producer/consumer threads to never share a physical core, but remain on the same CPU socket. In my particular hardware this means I'll use the 12 physical cores on one CPU using taskset -c 0-11.
For the sake of keeping things simple I ran the benchmarks several times and averaged the results, no fancy graphs (still recovering from the new year). Throughput results are quoted in ops per microsecond, if you are more comfortable with ops per second just multiply by 1 million. Latency results are in nanoseconds per operation.

Throughput Benchmarks: Hand rolled 1 to 1

Let's start with the simplest benchmark in JCTools. The QueuePerfTest is really a single producer/consumer test:
  • A producer thread is spun up which feeds into the queue as fast as it can.
  • The main thread plays the consumer thread and empties the queue as fast as it can. 
The original benchmark was a sample used by Martin Thompson in one of his talks and I tweaked it further. It's nice and simple and has the added benefit of verifying that all the elements arrived at their destination. It also opens the door to loop unrolling which you may or may not consider a reasonable use case. This benchmark calls Thread.yield() if an offer/poll call fails which represents a rather forgiving backoff policy, which is why we also have the BusyQueuePerfTest which busy spins on fail.
Here's what we get:
Queue  Benchmark            Throughput(ops/µs)
CLQ    QueuePerfTest        10.7
Mpmc   QueuePerfTest        65.1
CLQ    BusyQueuePerfTest    15.3
Mpmc   BusyQueuePerfTest    63.5
         
The observed throughput for CLQ varies quite a bit between iterations, but the full run average is not too unstable. The performance also depends on the size of the heap as CLQ allocates a node per item passed and the GC overhead can become a significant part of management. In my runs I set the heap size to 1GB and the benchmark also calls System.gc() between measurement iterations. In any case, off to a good start, the JCTools queue is looking well.

Throughput benchmarks: JMH Joy

I'm a big JMH fan (see the JMH reference page), and these benchmarks show some of the power JMH has. The benchmark code is very simple and it's clear what's being measured:
  • We have 2 methods, offer and poll
  • Thread groups hammer each method
  • We use @AuxCounters to report successful/failed interactions with the queue
  • The throughput is the number of pollsMade, i.e. the number of items delivered
The nice thing is that once I got this setup I can now play with the number of threads in each group with no further investment (see an introduction to multi-threaded benchmarks with JMH).  I can run several iterations/forks and go make some coffee while JMH crunches through the lot and gives me a nice report (JMH options used for these benchmarks: "-gc -f 5 -i 5 -wi 5 -p qType=MpmcArrayQueue,ConcurrentLinkedQueue -p qCapacity=1000000" the -gc option forces a GC cycle between iterations as before, I use the -tg option to control number of producer/consumer threads). Here's the results for QueueThroughputBackoffNone (again results are in ops/µs, so higher is better):
Queue  1P1C        2P2C      3P3C      4P4C      5P5C      6P6C
CLQ     7.9 ±1.7 , 4.7 ±0.2, 3.8 ±0.1, 3.7 ±0.4, 3.1 ±0.1, 2.9 ±0.1
Mpmc   68.4 ±11.1, 9.2 ±1.1, 7.3 ±0.5, 5.8 ±0.4, 5.2 ±0.4, 5.3 ±0.2

Note the columns stand for number of producers/consumers so 1P1C will be running with -tg 1,1 and will have one thread hitting the offer and one the poll. 2P2C will have 2 consumers and 2 producers etc.
So on the one hand, joy, MPMC still consistently ahead, on the other hand it is not some sort of magic cure for contention. If you have multiple threads hitting a shared resource you will suffer for it. The initial hit is the worst, but the following degraded performance curve isn't too bad and we seem to stay ahead.

Average Round Trip Time Benchmarks


This benchmark was set up to model a bursty load on queue implementations where we measure the time it takes for a burst of messages to travel from an originating thread, to a chain of threads inter-linked by queues, back to the same thread. This benchmark is focusing on near empty queues and the notification/wakeup time from when an item is placed in a queue until it becomes visible to another thread. It also highlights any batching optimisations impact as bursts grow larger.
The benchmark has been discussed in some detail in 2 previous posts(1, 2)
We have 2 axis we can explore here, burst size and ring size, I have not explore all the possibilities. The results seem quite straightforward and I'm short on time. I tested with burst size=1,10,100 in the default configuration, i.e. chain length 2 (so a ping-pong kind of setup), and just for the heck of it I ran the burst size=100 with a chain length of 8 (so 1->2->3...->7->8->1). There you go (results are now in nanoseconds per operation, lower is better):
Queue  b=1,c=2   b=10,c=2   b=100,c=2    b=100,c=8
CLQ    488 ±11,  2230 ±98,  15755 ±601,  24886 ±2565
Mpmc   422 ±7,   1718 ±51,   5144 ±287,   8714 ± 200

Note that the headers shorthand stands for burst size(b) and chain length(c) so b=1,c=2 which is the default config for the benchmark stands for burst size of 1 (so send 1 message and wait until 1 message is received back) and chain length of 2 (so 2 threads linked up in this ring: 1->2->1).
The difference on single element exchange is not that great, but as burst size grows the gap widens significantly. This is perhaps down to the fact MPMC enjoys a much denser delivery of data, minimising the cache misses experienced as part of the message exchange. Note that extending the length of the chain seems to add the same percentage of overhead for each implementation, resulting in the same ratio for chain length 2 as we did for chain length 8 (MPMC is 3 times 'cheaper' than CLQ)

Summary

This was all rather dry, but I hope it helps people place the MPMC alternative in context. I would suggest you consider using MpmcArrayQueue in your application as a replacement to CLQ if:
  • You need a bounded queue
  • You are concerned about latency
  • You don't need to use the full scope of methods offered by Queue and can make do with the limited set supported in JCTools queues