Showing posts with label IPC. Show all posts
Showing posts with label IPC. Show all posts

Monday, 1 July 2013

A Java Ping Buffet

Buffet pings
When considering latency/response time in the context of client/server interactions I find it useful to measure the baseline, or no-op, round trip between them. This give me a real starting point to appreciate the latency 'budget' available to me. This blog post presents an open source project offering several flavours of this baseline measurement for server connectivity latency.

How did it come about?

I put the first variation together a while back to help review baseline assumptions on TCP response time for Java applications. I figured it would be helpful to have a baseline latency number of the network/implementation stack up to the application and back. This number may seem meaningless to some, as the application does nothing much, but it serves as a boundary, a ballpark figure. If you ever did a course about networking and the layers between your application and the actual wire (the network stack), you can appreciate this measurement covers a round trip from the application layer and back.
This utility turned out to be quite useful (both to myself and a few colleagues) in the past few months. I tinkered with it some more, added another flavour of ping, and another. And there you have it, a whole bloody buffet of them. The project now implements the same baseline measurement for:
  • TCP
    • Busy spin on non-blocking sockets
    • Selector busy spin on selectNow
    • Selector blocking on select
    • Blocking sockets
    • Old IO sockets are not covered(maybe later)
  • UDP
    • Busy spin on non-blocking sockets
    • All the other cases covered for TCP are not covered(maybe later)
  • IPC via memory mapped file
This code is not entirely uniform and I beg your forgiveness (and welcome you criticism) if it offends your sensibilities. The aim was for simplicity and little enough code that it needs little in the way of explaining. All it does is ping, and measure. All measurements are in nanoseconds (Thanks Ruslan for pointing out the omission).
The original TCP spinning client/server code was taken from one of Peter Lawrey's examples, but it has been mutilated plenty since, so it's not really his fault if you don't like it. I also had great feedback and even some code contribution from Darach Ennis. Many thanks to both.
My mother has a wicked back hand

Taking the code for a drive

Imagine that you got some kit you want to measure Java baseline network performance on. The reality of these things is that the performance is going to vary for JVM/OS/Hardware and tuning for any and all of the ingredients. So off you go building a java-ping.zip (ant dist), you copy it onto your server/servers of choice and unzip (unzip java-ping.zip -d moose). You'll find the zip is fairly barebones and contains some scripts and a jar. You'll need to make the scripts runnable (chmod a+x *.sh). Now assuming you have Java installed you can start the server:
$ ./tcp-server.sh spin &
And then the client:
$ ./tcp-client.sh spin 

And in lovely CSV format the stats will pour into your terminal, making you look busy.
Min,50%,90%,99%,99.9%,99.99%,Max
6210,6788,9937,20080,23499,2189710,46046305
6259,6803,7464,8571,10662,17259,85020
6275,6825,7445,8520,10381,16981,36716
6274,6785,7378,8539,10396,16322,19694
6209,6752,7336,8458,10381,16966,55930
6272,6765,7309,8521,10391,15288,6156039
6216,6775,7382,8520,10385,15466,108835
6260,6756,7266,8508,10456,17953,63773
Using the above as a metric you can fiddle with any and all the variables available to you and compare before/after/this/that configuration.
In the previous post on this utility I covered the variance you can observe for taskset versus roaming processes, so I won't bore you with it again. All the results below were acquired while using taskset. For IPC you'll get better results when pinning to the same core (different threads) but worse tail. For TCP/UDP the best results I observed were across different cores on same socket. If you are running across 2 machines then ignore the above and pin as makes sense to you (on NUMA hardware the NIC can be aligned to a particular socket, have fun).
The tool allows for further tweaking of weather or not it will yield when busy-spinning (-Dyield=true) and adding a wait between pings (-DwaitNanos=1000). These are provided to give you a flavour of what can happen as you relax the hot loops into something closer to a back-off strategy, and as you let the client/server 'drift in their attention'.

Observing the results for the different flavours

The keen observer will notice that average latency is not reported. Average latency is not latency. Average latency is just TimeUnit/throughput. If you have a latency SLA you should know that. An average is a completely inappropriate tool for measuring latency. Take for example the case where half your requests take 0.0001 millis and half take 99.9999 millis, how is the average latency of 50 millis useful to you? Gil Tene has a long presentation on the topic which is worth a watch if the above argument is completely foreign to you.
The results are a range of percentiles, it's easy enough to add further analysis as all the observed latencies are recorded (all numbers are in nanoseconds). I considered using a histogram implementation (like the one in the Disruptor, or HdrHistogram) but decided it was better to stick to the raw data for something this small and focused. This way no precision is lost at the cost of a slightly larger memory footprint. This is not necessarily appropriate for every use case.
Having said all that, here is a sample of the results for running the code on semi-respectable hardware (all runs are pinned using taskset, all on default settings, all numbers are in nanoseconds):
Implementation, Min,   50%,   90%,   99%,   99.9%, 99.99%,Max
IPC busy-spin,  89,    127,   168,   3326,  6501,  11555, 25131
UDP busy-spin,  4597,  5224,  5391,  5958,  8466,  10918, 18396
TCP busy-spin,  6244,  6784,  7475,  8697,  11070, 16791, 27265
TCP select-now, 8858,  9617,  9845,  12173, 13845, 19417, 26171
TCP block,      10696, 13103, 13299, 14428, 15629, 20373, 32149
TCP select,     13425, 15426, 15743, 18035, 20719, 24793, 37877

Bear in mind that this is RTT(Round Trip Time) so a request-response timing. The above measurement are also over loopback, so no actual network hop. The network hop on 2 machines hooked into each other via a network cable will be similar, anything beyond that and your actual network stack will become more and more significant. Nothing can cure geography ;-)
I am sure there are further tweaks to make in the stack to improve the results. Maybe the code, maybe the OS tuning, maybe the JVM version. It doesn't matter. The point is you can take this and measure your stack. The numbers may differ, but the relative performance should be fairly similar.

Is it lunch time?

This is a bit of a detour, but bear with me. On the IPC side of things we should also start asking ourselves: what is the System.nanotime() measurement error? what sort of accuracy can we expect?
I added an ErrPingClient which runs the test loop with no actual ping logic, the result:
Min, 50%, 90%, 99%, 99.9%, 99.99%,Max
38,  50,  55,  56,  59,    80,    8919

Is this due to JVM hiccups? inherent inaccuracy of the underlying measurement method used by the JVM? in this sort of time scale the latency measurement becomes a problem onto itself and we have to revert to counting on (horrors!) average latency over a set of measurements to cancel out the inaccuracy. To quote the Hitchhikers Guide: "Time is an illusion, and lunch time doubly so", we are not going to get exact timings at this resolution, so we will need to deal with error. Dealing with this error is not something the code does for you, just be aware some error is to be expected.

What is it good for?

My aim with this tool (if you can call it that) was to uncover baseline costs of network operations on a particular setup. This is a handy figure to have when judging the overhead introduced by a framework/API. No framework in the world could beat a bare bones implementation using the same ingredients, but knowing the difference educates our shopping decisions. For example, if your 'budget' for response time is low the overhead introduced by the framework of your choice might not be appropriate. If the overhead is very high perhaps there is a bug in the framework or how you use it.
As the tool is deployable you can also use it to validate the setup/configuration and use that data to help expose issues independent of your software.
Finally, it is a good tool to help people who have grown to expect Java server applications response time to be in the tens of milliseconds range wake up and smell the scorching speed of today's hardware :-) 

Sunday, 7 April 2013

135 Million messages a second between processes in pure Java

{This post is part of a long running series on lock free queues, checkout the full index to get more context here}
Porting an existing single producer/single consumer concurrent queue into an IPC mechanism via memory mapped files and getting 135 million messages throughput in pure Java.
In my previous post I covered a single producer/consumer queue developed and shared by Martin Thompson capable of delivering an amazing 130M messages per second. The queue he delivered is a great tool for communicating between threads, but sometimes communicating between threads is not enough. Sometime you need to leave your JVM and go out of process. Inter Process Communications (IPC) is a different problem to inter thread communications, can it be cracked by the same approach?

IPC, what's the problem?

Inter Process Communication is an old problem and there are many ways to solve it (which I will not discuss here). There are several attractions to specialized IPC solutions for Java:
  • Faster than socket communication.
  • An out of process integration option with applications written in other languages.
  • A means of splitting large VMs to smaller ones improving performance by allowing GC and JIT specialization.
For IPC to be attractive it has to be fast, otherwise you may as well go for network based solutions which would extend beyond your local machine uniformly. I attended an Informatica conference a while back and got talking to Todd Montgomerey about the Disruptor and mechanical sympathy, he suggested that IPC should be able to perform as well as inter thread messaging. I found the idea interesting and originally meant to port the Disruptor, but Martin's queue is simpler (and quicker) so I went for that instead. Starting with a good algorithm/data structure is very good indeed, now I just needed to bridge the gap and see if I can maintain the benefits.

 

Off the heap we go!

To do IPC we must go off heap. This has several implications for the queue, most importantly references are not supported. Also note persistence to and from the queue is required, though one could extend my implementation to support a zero copy interaction where a struct is acquired, written and committed instead of the offer method, and similarly acquired, read and finally released instead of the poll method. I plan to make several flavours of this queue to test out these ideas in the near future.
My IPC queue uses a memory mapped file as a means of acquiring a chunk of shared memory, there is no intention to use the persisted values though further development in that direction may prove interesting to some. So now that I got me some shared memory, I had to put the queue in it.
I started by laying out the queue counters and cached counters. After realizing the counters need to be aligned to work properly I learnt how to align memory in Java. I went on to verify that aligned memory offers the guarantees required for concurrent access. Quick summary:
  • aligned access means writing data types to addresses which divide by their size.
  • unaligned access is not atomic, which is bad for concurrency :(
  • unaligned access is slow, which is bad for performance :(
  • unaligned access may not work, depending on OS and architecture. Not working is very bad :(
Sorting out alignment is not such a big deal once you know how it works. One of the nice things about going off-heap was that solving false sharing has become far more straightforward. Move your pointer and you're in the next cache line, job done. This left me rather frustrated me with the tricks required to control memory layout in Java. Going back to the original implementation you will notice the Padded classes who's role it is to offer false sharing protection. They are glorious hacks (with all due respect) made necessary by this lack of control. The @Contended annotation coming in JDK 8 will hopefully remove the need for this.
This is how the memory layout worked out:


To illustrate in glorious ASCII graphics (each - is a byte), this is what the memory layout looks like when broken into cache lines:
|--------|--------|--------|head....|--------|--------|--------|--------|
|--------|--------|--------|tailCach|--------|--------|--------|--------|
|--------|--------|--------|tail----|--------|--------|--------|--------|
|--------|--------|--------|headCach|--------|--------|--------|--------|
|int1int2|int3int4|int5int6|etcetcet|cetcetce|tcetcetc|etcetcet|cetcetce|
...



I played around with mixing off heap counters with on heap buffer but in the interest of brevity I'll summarize and say the JVM does not like that very much and the end result performance is not as good as all heap/off-heap solutions. The code is available with everything else.
Once alignment and memory layout were sorted I had to give up the flexibility of having reference pointers and settle for writing my data (an integer) directly into the memory. This leaves my queue very restrictive in it's current form. I intend to revisit it and see what I can do to offer a more extendable API on top of it.
Let me summarize the recipe at this point:
  • Create a memory mapped file large enough to hold:
    • 4 cache lines for counters/cached counters.
    • 4 bytes(per integer) * queue capacity (must be a power of 2).
    • 1 spare cache line to ensure you can align the above to the cache line.
  • Get a mapped byte buffer, which is a direct byte buffer on top of the mapped memory.
  • Steal the address and get the contained aligned byte buffer.
  • Setup pointers to the counters and the beginning of the buffer
  • Replace use of natural counters with off heap counters accessed via Unsafe using the pointers.
  • Replace use of array with use of offset pointers into buffer and Unsafe access.
  • Test and debug until you work out the kinks...
The above code should give you a fair idea how it works out and the rest is here. This queue can work in process and out of process as demonstrated in the tests included in the repository. Now that it works (for the limited use case, and with room for further improvement... but works), is it fast enough? not so fast? is it...<gasp> ... FASTER????!?!?!

Smithers, release the hounds

Here are the numbers for using the different implementations in process:

Implementation/AffinitySame coreCross coreCross socket
P1C1QueueOriginal3110M130M19M
P1C1OffHeapQueue130M220M200M
P1C1QueueOriginalPrimitive124M220M215M


Confused? Let me explain. First line is the measurements taken for the original queue. Similar to what was presented in prev. post, though I saw a slight improvement in the results with increasing the compile threshold to 100000.
The second line is my offheap implementation of same algorithm. It is significantly faster. This is not IPC yet, this is in process. The reason it is faster is because data is inlined in the queue, which means that by loading an entry in the queue we get the data as opposed to a reference to the data. Getting a reference is what you get when you have and Object[] array. The array holds the references and the data is elsewhere, this seems to make it more painful as we get further from the producer.
The last entry is a mutation of P1C1QueueOriginal3 into a primitive array backed queue to compare performance like for like. As you can see this displays very similar results to the off heap implementation supporting the theory that data in-lining is behind the observed performance boost.
The lesson here is an old one, namely that pointer chasing is expensive business further amplified by the distance between the producing CPU and consuming CPU.
The off-heap queue can offer an alternative to native code integration as the consuming thread may interact directly with the off-heap queue and write results back to a different off-heap queue.
Running a similar benchmark adapted to use a memory mapped file as the backing DirectByteBuffer for the off-heap queue we get:
    same core      - ops/sec=135M
    across cores   - ops/sec=98M
    across sockets - ops/sec=25M



JOY! a pure Java IPC that gives you 135M messages per second is more throughput then you'd get with most commercial products out there. This is still not as fast as the same queue in process and I admit I'm not sure what the source of the performance difference is. Still I am quite happy with it.
A few notes/observations from the experimentation process:
  1. I got a variety of results, stabilizing around different average throughputs. I chose the best for the above summary and plan to go into detail about the results in the near future.
  2. The JVM was launched with: -XX:+UseCondCardMark -XX:CompileThreshold=100000
  3. Removing the Thread.yield from the producer/consumer loops improved performance when running on the same core, but made it worse otherwise.
  4. Moving the queue allocation into the test loop changes the performance profile dramatically.
  5. I've not had time to fully explore the size of the queue as a variable in the experiment but the little I've done suggests it makes a difference, choose the right size for your application.
I realize this post is rather less accessible than the previous one, so if you have any questions please ask.