Psychosomatic, Lobotomy, Saw: A Java Ping Buffet

Buffet pings

When considering latency/response time in the context of client/server interactions I find it useful to measure the baseline, or no-op, round trip between them. This give me a real starting point to appreciate the latency 'budget' available to me. This blog post presents an open source project offering several flavours of this baseline measurement for server connectivity latency.

How did it come about?

I put the first variation together a while back to help review baseline assumptions on TCP response time for Java applications. I figured it would be helpful to have a baseline latency number of the network/implementation stack up to the application and back. This number may seem meaningless to some, as the application does nothing much, but it serves as a boundary, a ballpark figure. If you ever did a course about networking and the layers between your application and the actual wire (the network stack), you can appreciate this measurement covers a round trip from the application layer and back.

This utility turned out to be quite useful (both to myself and a few colleagues) in the past few months. I tinkered with it some more, added another flavour of ping, and another. And there you have it, a whole bloody buffet of them. The project now implements the same baseline measurement for:

Busy spin on non-blocking sockets
Selector busy spin on selectNow
Selector blocking on select
Blocking sockets
Old IO sockets are not covered(maybe later)

Busy spin on non-blocking sockets
All the other cases covered for TCP are not covered(maybe later)

IPC via memory mapped file

This code is not entirely uniform and I beg your forgiveness (and welcome you criticism) if it offends your sensibilities. The aim was for simplicity and little enough code that it needs little in the way of explaining. All it does is ping, and measure. All measurements are in nanoseconds (Thanks Ruslan for pointing out the omission).
The original TCP spinning client/server code was taken from one of Peter Lawrey's examples, but it has been mutilated plenty since, so it's not really his fault if you don't like it. I also had great feedback and even some code contribution from Darach Ennis. Many thanks to both.

My mother has a wicked back hand

Taking the code for a drive

Imagine that you got some kit you want to measure Java baseline network performance on. The reality of these things is that the performance is going to vary for JVM/OS/Hardware and tuning for any and all of the ingredients. So off you go building a java-ping.zip (ant dist), you copy it onto your server/servers of choice and unzip (unzip java-ping.zip -d moose). You'll find the zip is fairly barebones and contains some scripts and a jar. You'll need to make the scripts runnable (chmod a+x *.sh). Now assuming you have Java installed you can start the server:

$ ./tcp-server.sh spin &

And then the client:

$ ./tcp-client.sh spin

And in lovely CSV format the stats will pour into your terminal, making you look busy.

Min,50%,90%,99%,99.9%,99.99%,Max
6210,6788,9937,20080,23499,2189710,46046305
6259,6803,7464,8571,10662,17259,85020
6275,6825,7445,8520,10381,16981,36716
6274,6785,7378,8539,10396,16322,19694
6209,6752,7336,8458,10381,16966,55930
6272,6765,7309,8521,10391,15288,6156039
6216,6775,7382,8520,10385,15466,108835
6260,6756,7266,8508,10456,17953,63773

Using the above as a metric you can fiddle with any and all the variables available to you and compare before/after/this/that configuration.
In the previous post on this utility I covered the variance you can observe for taskset versus roaming processes, so I won't bore you with it again. All the results below were acquired while using taskset. For IPC you'll get better results when pinning to the same core (different threads) but worse tail. For TCP/UDP the best results I observed were across different cores on same socket. If you are running across 2 machines then ignore the above and pin as makes sense to you (on NUMA hardware the NIC can be aligned to a particular socket, have fun).
The tool allows for further tweaking of weather or not it will yield when busy-spinning (-Dyield=true) and adding a wait between pings (-DwaitNanos=1000). These are provided to give you a flavour of what can happen as you relax the hot loops into something closer to a back-off strategy, and as you let the client/server 'drift in their attention'.

Observing the results for the different flavours

The keen observer will notice that average latency is not reported. Average latency is not latency. Average latency is just TimeUnit/throughput. If you have a latency SLA you should know that. An average is a completely inappropriate tool for measuring latency. Take for example the case where half your requests take 0.0001 millis and half take 99.9999 millis, how is the average latency of 50 millis useful to you? Gil Tene has a long presentation on the topic which is worth a watch if the above argument is completely foreign to you.
The results are a range of percentiles, it's easy enough to add further analysis as all the observed latencies are recorded (all numbers are in nanoseconds). I considered using a histogram implementation (like the one in the Disruptor, or HdrHistogram) but decided it was better to stick to the raw data for something this small and focused. This way no precision is lost at the cost of a slightly larger memory footprint. This is not necessarily appropriate for every use case.
Having said all that, here is a sample of the results for running the code on semi-respectable hardware (all runs are pinned using taskset, all on default settings, all numbers are in nanoseconds):

Implementation, Min, 50%, 90%, 99%, 99.9%, 99.99%,Max
IPC busy-spin, 89, 127, 168, 3326, 6501, 11555, 25131
UDP busy-spin, 4597, 5224, 5391, 5958, 8466, 10918, 18396
TCP busy-spin, 6244, 6784, 7475, 8697, 11070, 16791, 27265
TCP select-now, 8858, 9617, 9845, 12173, 13845, 19417, 26171
TCP block, 10696, 13103, 13299, 14428, 15629, 20373, 32149
TCP select, 13425, 15426, 15743, 18035, 20719, 24793, 37877

Bear in mind that this is RTT(Round Trip Time) so a request-response timing. The above measurement are also over loopback, so no actual network hop. The network hop on 2 machines hooked into each other via a network cable will be similar, anything beyond that and your actual network stack will become more and more significant. Nothing can cure geography ;-)

I am sure there are further tweaks to make in the stack to improve the results. Maybe the code, maybe the OS tuning, maybe the JVM version. It doesn't matter. The point is you can take this and measure your stack. The numbers may differ, but the relative performance should be fairly similar.

Is it lunch time?

This is a bit of a detour, but bear with me. On the IPC side of things we should also start asking ourselves: what is the System.nanotime() measurement error? what sort of accuracy can we expect?
I added an ErrPingClient which runs the test loop with no actual ping logic, the result:

Min, 50%, 90%, 99%, 99.9%, 99.99%,Max
38, 50, 55, 56, 59, 80, 8919

Is this due to JVM hiccups? inherent inaccuracy of the underlying measurement method used by the JVM? in this sort of time scale the latency measurement becomes a problem onto itself and we have to revert to counting on (horrors!) average latency over a set of measurements to cancel out the inaccuracy. To quote the Hitchhikers Guide: "Time is an illusion, and lunch time doubly so", we are not going to get exact timings at this resolution, so we will need to deal with error. Dealing with this error is not something the code does for you, just be aware some error is to be expected.

What is it good for?

My aim with this tool (if you can call it that) was to uncover baseline costs of network operations on a particular setup. This is a handy figure to have when judging the overhead introduced by a framework/API. No framework in the world could beat a bare bones implementation using the same ingredients, but knowing the difference educates our shopping decisions. For example, if your 'budget' for response time is low the overhead introduced by the framework of your choice might not be appropriate. If the overhead is very high perhaps there is a bug in the framework or how you use it.

As the tool is deployable you can also use it to validate the setup/configuration and use that data to help expose issues independent of your software.

Finally, it is a good tool to help people who have grown to expect Java server applications response time to be in the tens of milliseconds range wake up and smell the scorching speed of today's hardware :-)

11 comments:

Unknown1 Jul 2013, 08:38:00
Thanks for your blog post. A quick scan of the source suggests that you're not actually using ICMP packets. Is that correct?
Nitsan1 Jul 2013, 08:55:00
Correct. This is an application level ping, not the ping application/protocol. Fair point.
Ruslan Cheremin1 Jul 2013, 09:13:00
...Could you print units of time with your results?
Unknown26 Feb 2014, 16:58:00
1. why closing serversocket so early in the code instead of in the finally {} block ?
2. would it be useful to also provide a bench mark for connection latency for Tcp blocking / vs Tcp nio busy spin ?
Anonymous13 Jul 2017, 19:48:00
I am really wondering what can cause 11 milliseconds swing when I ran this on a very powerful Linux server.

As you can see , the average time measure on empty ping can swing from 40 nano seconds to 20 microseconds, so the TCP latency testing in our 10 GbE environment is meaningless since it's well under 20 microseconds with onload driver.

_____________________________________________
java-ping # taskset -c 9 ./err-client.sh
Min,50%,90%,99%,99.9%,99.99%,Max
@36,40,40,132,174,186,4754628
@37,40,40,40,101,105,11009008
@39,50,64,65,74,83,22554
@39,50,64,65,74,99,28717
@39,50,64,65,74,97,34797
@40,50,64,65,68,77,26473
@40,50,64,65,68,87,22941
@40,50,64,65,68,75,34179
@39,50,64,65,72,92,7754
@39,50,64,65,73,88,10797
@39,50,64,65,73,85,17153
@39,50,64,65,73,83,11590
@38,50,64,65,73,100,6792
@40,50,64,65,68,86,507535
@40,50,64,65,68,96,14464
@40,50,64,65,68,89,7980
@39,50,64,65,72,86,11057
@38,50,64,65,73,93,26564
@39,50,64,65,73,102,35056
@38,50,64,65,73,83,26706
@39,50,64,65,73,95,35276
@39,50,64,65,69,86,28051
@40,50,64,65,68,75,35067
@40,50,64,65,68,86,34854
@40,50,64,65,68,85,36261
@39,50,64,65,74,100,6885
@38,50,64,65,74,122,13323
@39,50,64,65,73,135,8328
@39,50,64,65,74,83,7784
@39,50,64,65,73,93,7405

Anonymous1 Aug 2018, 09:01:00
I see all of the implementation use Nio. I compared Nio DatagramChannel vs DatagramSocket and found DatagramSocket to be slightly better on all percentiles, especially regarding outliers.
DatagramChannel max RTT was 2.7 msec compared to 26 msec on DatagramChannel.
Have you tested DatagramSocket implementation ?
Unknown24 Feb 2019, 04:27:00
Great tool!
I tried it to together with the impact of +-UseBiasedLocking, I use the tcpserver/client spin, which have no contention in threads and hence i expect +UseBiasedLocking should have some improvement, as the JDK socket writing and reading involves lots of synchronized block. (e.g synchronized (readLock) {...} synchronized (stateLock) {...}

However, i can consistently reproduce the fact that -UseBiasedLocking is in nearly all lines better (typical line as below)
@4285,4675,4865,5341,19046,24489,30971

while +UseBiasedLocking is typically :
@4454,4869,5250,5499,19050,27714,39104

May I ask expert like you why?

Note: only a member of this blog may post a comment.

Monday, 1 July 2013

A Java Ping Buffet