Monday 1 July 2013

A Java Ping Buffet

Buffet pings
When considering latency/response time in the context of client/server interactions I find it useful to measure the baseline, or no-op, round trip between them. This give me a real starting point to appreciate the latency 'budget' available to me. This blog post presents an open source project offering several flavours of this baseline measurement for server connectivity latency.

How did it come about?

I put the first variation together a while back to help review baseline assumptions on TCP response time for Java applications. I figured it would be helpful to have a baseline latency number of the network/implementation stack up to the application and back. This number may seem meaningless to some, as the application does nothing much, but it serves as a boundary, a ballpark figure. If you ever did a course about networking and the layers between your application and the actual wire (the network stack), you can appreciate this measurement covers a round trip from the application layer and back.
This utility turned out to be quite useful (both to myself and a few colleagues) in the past few months. I tinkered with it some more, added another flavour of ping, and another. And there you have it, a whole bloody buffet of them. The project now implements the same baseline measurement for:
  • TCP
    • Busy spin on non-blocking sockets
    • Selector busy spin on selectNow
    • Selector blocking on select
    • Blocking sockets
    • Old IO sockets are not covered(maybe later)
  • UDP
    • Busy spin on non-blocking sockets
    • All the other cases covered for TCP are not covered(maybe later)
  • IPC via memory mapped file
This code is not entirely uniform and I beg your forgiveness (and welcome you criticism) if it offends your sensibilities. The aim was for simplicity and little enough code that it needs little in the way of explaining. All it does is ping, and measure. All measurements are in nanoseconds (Thanks Ruslan for pointing out the omission).
The original TCP spinning client/server code was taken from one of Peter Lawrey's examples, but it has been mutilated plenty since, so it's not really his fault if you don't like it. I also had great feedback and even some code contribution from Darach Ennis. Many thanks to both.
My mother has a wicked back hand

Taking the code for a drive

Imagine that you got some kit you want to measure Java baseline network performance on. The reality of these things is that the performance is going to vary for JVM/OS/Hardware and tuning for any and all of the ingredients. So off you go building a java-ping.zip (ant dist), you copy it onto your server/servers of choice and unzip (unzip java-ping.zip -d moose). You'll find the zip is fairly barebones and contains some scripts and a jar. You'll need to make the scripts runnable (chmod a+x *.sh). Now assuming you have Java installed you can start the server:
$ ./tcp-server.sh spin &
And then the client:
$ ./tcp-client.sh spin 

And in lovely CSV format the stats will pour into your terminal, making you look busy.
Min,50%,90%,99%,99.9%,99.99%,Max
6210,6788,9937,20080,23499,2189710,46046305
6259,6803,7464,8571,10662,17259,85020
6275,6825,7445,8520,10381,16981,36716
6274,6785,7378,8539,10396,16322,19694
6209,6752,7336,8458,10381,16966,55930
6272,6765,7309,8521,10391,15288,6156039
6216,6775,7382,8520,10385,15466,108835
6260,6756,7266,8508,10456,17953,63773
Using the above as a metric you can fiddle with any and all the variables available to you and compare before/after/this/that configuration.
In the previous post on this utility I covered the variance you can observe for taskset versus roaming processes, so I won't bore you with it again. All the results below were acquired while using taskset. For IPC you'll get better results when pinning to the same core (different threads) but worse tail. For TCP/UDP the best results I observed were across different cores on same socket. If you are running across 2 machines then ignore the above and pin as makes sense to you (on NUMA hardware the NIC can be aligned to a particular socket, have fun).
The tool allows for further tweaking of weather or not it will yield when busy-spinning (-Dyield=true) and adding a wait between pings (-DwaitNanos=1000). These are provided to give you a flavour of what can happen as you relax the hot loops into something closer to a back-off strategy, and as you let the client/server 'drift in their attention'.

Observing the results for the different flavours

The keen observer will notice that average latency is not reported. Average latency is not latency. Average latency is just TimeUnit/throughput. If you have a latency SLA you should know that. An average is a completely inappropriate tool for measuring latency. Take for example the case where half your requests take 0.0001 millis and half take 99.9999 millis, how is the average latency of 50 millis useful to you? Gil Tene has a long presentation on the topic which is worth a watch if the above argument is completely foreign to you.
The results are a range of percentiles, it's easy enough to add further analysis as all the observed latencies are recorded (all numbers are in nanoseconds). I considered using a histogram implementation (like the one in the Disruptor, or HdrHistogram) but decided it was better to stick to the raw data for something this small and focused. This way no precision is lost at the cost of a slightly larger memory footprint. This is not necessarily appropriate for every use case.
Having said all that, here is a sample of the results for running the code on semi-respectable hardware (all runs are pinned using taskset, all on default settings, all numbers are in nanoseconds):
Implementation, Min,   50%,   90%,   99%,   99.9%, 99.99%,Max
IPC busy-spin,  89,    127,   168,   3326,  6501,  11555, 25131
UDP busy-spin,  4597,  5224,  5391,  5958,  8466,  10918, 18396
TCP busy-spin,  6244,  6784,  7475,  8697,  11070, 16791, 27265
TCP select-now, 8858,  9617,  9845,  12173, 13845, 19417, 26171
TCP block,      10696, 13103, 13299, 14428, 15629, 20373, 32149
TCP select,     13425, 15426, 15743, 18035, 20719, 24793, 37877

Bear in mind that this is RTT(Round Trip Time) so a request-response timing. The above measurement are also over loopback, so no actual network hop. The network hop on 2 machines hooked into each other via a network cable will be similar, anything beyond that and your actual network stack will become more and more significant. Nothing can cure geography ;-)
I am sure there are further tweaks to make in the stack to improve the results. Maybe the code, maybe the OS tuning, maybe the JVM version. It doesn't matter. The point is you can take this and measure your stack. The numbers may differ, but the relative performance should be fairly similar.

Is it lunch time?

This is a bit of a detour, but bear with me. On the IPC side of things we should also start asking ourselves: what is the System.nanotime() measurement error? what sort of accuracy can we expect?
I added an ErrPingClient which runs the test loop with no actual ping logic, the result:
Min, 50%, 90%, 99%, 99.9%, 99.99%,Max
38,  50,  55,  56,  59,    80,    8919

Is this due to JVM hiccups? inherent inaccuracy of the underlying measurement method used by the JVM? in this sort of time scale the latency measurement becomes a problem onto itself and we have to revert to counting on (horrors!) average latency over a set of measurements to cancel out the inaccuracy. To quote the Hitchhikers Guide: "Time is an illusion, and lunch time doubly so", we are not going to get exact timings at this resolution, so we will need to deal with error. Dealing with this error is not something the code does for you, just be aware some error is to be expected.

What is it good for?

My aim with this tool (if you can call it that) was to uncover baseline costs of network operations on a particular setup. This is a handy figure to have when judging the overhead introduced by a framework/API. No framework in the world could beat a bare bones implementation using the same ingredients, but knowing the difference educates our shopping decisions. For example, if your 'budget' for response time is low the overhead introduced by the framework of your choice might not be appropriate. If the overhead is very high perhaps there is a bug in the framework or how you use it.
As the tool is deployable you can also use it to validate the setup/configuration and use that data to help expose issues independent of your software.
Finally, it is a good tool to help people who have grown to expect Java server applications response time to be in the tens of milliseconds range wake up and smell the scorching speed of today's hardware :-) 

11 comments:

  1. Thanks for your blog post. A quick scan of the source suggests that you're not actually using ICMP packets. Is that correct?

    ReplyDelete
    Replies
    1. This falls under: http://en.wikipedia.org/wiki/Ping_(networking_utility)#Other_types_of_pinging

      Delete
  2. Correct. This is an application level ping, not the ping application/protocol. Fair point.

    ReplyDelete
  3. ...Could you print units of time with your results?

    ReplyDelete
    Replies
    1. Damn. A rookie mistake there. It's all nanoseconds, I'll update the post to mention it. Would you want the utility to print it?

      Delete
  4. 1. why closing serversocket so early in the code instead of in the finally {} block ?
    2. would it be useful to also provide a bench mark for connection latency for Tcp blocking / vs Tcp nio busy spin ?

    ReplyDelete
    Replies
    1. Glad you are taking interest, but it's been a while since I looked at the code... so:
      1. I can't remember why, do you think it's a bug/issue? if so please use the github issue tracking mechanism
      2. It might be useful. If you want this to happen I'm happy to hear more, but be aware that it may be a long while before I get to it. If you have implemented it and want to send me a pull-request please do.
      Thanks :-)

      Delete
  5. I am really wondering what can cause 11 milliseconds swing when I ran this on a very powerful Linux server.

    As you can see , the average time measure on empty ping can swing from 40 nano seconds to 20 microseconds, so the TCP latency testing in our 10 GbE environment is meaningless since it's well under 20 microseconds with onload driver.

    _____________________________________________
    java-ping # taskset -c 9 ./err-client.sh
    Min,50%,90%,99%,99.9%,99.99%,Max
    @36,40,40,132,174,186,4754628
    @37,40,40,40,101,105,11009008
    @39,50,64,65,74,83,22554
    @39,50,64,65,74,99,28717
    @39,50,64,65,74,97,34797
    @40,50,64,65,68,77,26473
    @40,50,64,65,68,87,22941
    @40,50,64,65,68,75,34179
    @39,50,64,65,72,92,7754
    @39,50,64,65,73,88,10797
    @39,50,64,65,73,85,17153
    @39,50,64,65,73,83,11590
    @38,50,64,65,73,100,6792
    @40,50,64,65,68,86,507535
    @40,50,64,65,68,96,14464
    @40,50,64,65,68,89,7980
    @39,50,64,65,72,86,11057
    @38,50,64,65,73,93,26564
    @39,50,64,65,73,102,35056
    @38,50,64,65,73,83,26706
    @39,50,64,65,73,95,35276
    @39,50,64,65,69,86,28051
    @40,50,64,65,68,75,35067
    @40,50,64,65,68,86,34854
    @40,50,64,65,68,85,36261
    @39,50,64,65,74,100,6885
    @38,50,64,65,74,122,13323
    @39,50,64,65,73,135,8328
    @39,50,64,65,74,83,7784
    @39,50,64,65,73,93,7405

    ReplyDelete
    Replies
    1. I don't read the data provided as particularly problematic:
      - "the average time measure on empty ping can swing from 40 nano seconds to 20 microseconds" - The output is a percentile breakdown of the latencies observed. There's is no average figure here, so not sure what you mean by this comment.
      - "what can cause 11 milliseconds swing" - This is observed in the 2 first measurement iterations. I imagine this is a JVM warmup artifact, and a normal one at that. To verify this theory you need to log JVM stopped events (-XX:+PrintGCApplicationStoppedTime).
      The data you observe is not unusual. Up to 99.99% of calls to nanoTime terminate in 40ns to 135ns. See observations on this topic from Aleksey Shipilev: https://shipilev.net/blog/2014/nanotrusting-nanotime/#_one_off_measurements
      The max you observe (7 to 507us) is in the noise level expected on a server. The noise can be reduced by configuration. You can read Mark Price posts on the topic here:
      http://epickrram.blogspot.co.za/2015/11/reducing-system-jitter-part-2.html

      Delete
  6. I see all of the implementation use Nio. I compared Nio DatagramChannel vs DatagramSocket and found DatagramSocket to be slightly better on all percentiles, especially regarding outliers.
    DatagramChannel max RTT was 2.7 msec compared to 26 msec on DatagramChannel.
    Have you tested DatagramSocket implementation ?

    ReplyDelete
  7. Great tool!
    I tried it to together with the impact of +-UseBiasedLocking, I use the tcpserver/client spin, which have no contention in threads and hence i expect +UseBiasedLocking should have some improvement, as the JDK socket writing and reading involves lots of synchronized block. (e.g synchronized (readLock) {...} synchronized (stateLock) {...}

    However, i can consistently reproduce the fact that -UseBiasedLocking is in nearly all lines better (typical line as below)
    @4285,4675,4865,5341,19046,24489,30971

    while +UseBiasedLocking is typically :
    @4454,4869,5250,5499,19050,27714,39104

    May I ask expert like you why?

    ReplyDelete

Note: only a member of this blog may post a comment.