Psychosomatic, Lobotomy, Saw: perf-map-agent

FlameGraphs are superawesome. If you've never heard of FlameGraphs and want to dive straight in the deep end, you should run off and check out the many many good resources provided by Brendan Greg in his one stop shop page here. This post will give a quick intro and some samples to get you started with collecting profiles for all JVMs everywhere. I'm taking a slightly different tack then Brendan in presenting the topic, so if it turns out my explanations suck you should see if his make more sense.

What's the big deal?

If you've ever used a profiler to look at your code you will have seen 2 profile reports typically:

Flat profile: This is often presented as the "top X" methods/classes/packages where time (or samples, or ticks or whatever) is spent. This is useful as it immediately shows up common bottlenecks across your code, but these are shown out of context. Sometimes this is enough, but often in larger application profiles context is significant. This representation is very useful when a method with a high overall impact is called from many callsites, making each callsite cheap but the method itself significant.
Tree profile: This profile will present you with a call tree where each method is a node with a total and self time quantity. The self measure implies the amount of time spent in the method itself(the amout of samples in which the method is the leaf), and total is for the total number of samples in which it shows up (leaf and node).

The problem with the tree view is that it is very unpleasant to navigate. Click click click click and as the stack deepens it becomes harder to look at and ingest. Enter FlameGraphs.

FlameGraph represents a tree profile in a single interactive SVG where:

The x-axis shows the stack profile population, sorted alphabetically (it is not the passage of time), and the y-axis shows stack depth. Each rectangle represents a stack frame. The wider a frame is is, the more often it was present in the stacks. The top edge shows what is on-CPU, and beneath it is its ancestry.

Like most visualisations, it makes sense when you see it rather than explain it. Let start with data sets we can easily grasp and see what they look like.

Synthetic Samples For Starters

For instance, what does a single stack sample look like? The FlameGraphs SVG generating script takes as it's input a "collapsed stacks" file which has a dead simple format, frames separated by semi-colons followed by the number of times this stack was sampled. Here's a dummy handwritten example of a single sample file (call it sample.cstk):
main;0;1;2;3;4;5;6;7;8;9;10 1

We can feed this to the flames (now is a good time to clone this repo and try shit out):
flamegraph.pl single-stack.cstk > single-stack.svg

Here's what a single stack trace looks like:

But a single stack trace is just one data point, not a profile. What if we had 1M samples of this same stack?

Well.. it would look pretty much the same, but if you hover over it will tell you it got 1M samples. It looks the same because we still have 100% the same stack for the whole profile. It's the same profile.
"BUT!" I hear you say, "But, colours?". Yes the colors mean nothing at this point, but will become interesting later. The default colour palate is red and the choice of colors is random, hence the different colour selection changes from run to run. Just forget colors for a second, OK?
Right, next we want to look at a set of samples with a few more stacks:
main;0;1;2;3;4;5 1
main;0;1;2;3;4;5;6 2
main;0;1;2;3;4;5;6;7 3
main;0;1;2;3;4;5;6;7;8 4
main;0;1;2;3;4;5;6;7;8;9 5

Now you can also get a feel for what clicking around does and how you zoom in and out.
By now I hope you get the picture for how a bunch of stacks and frequencies look with a simple data sets. Last synthtic example to look at has several root frames and a little more varied stacks. Lets try this:
main;a;1;2;3;4;5 1
main;c;1;2;3;4;5;6;7 4
main;a;1;2;3;4;5;6 2
main;c;1;2;3;4;5;6 4
main;c;1;2;3;4;5;6;8 4
main;b;1;2;3;4;5;6;7 3
main;b;1;2;3;4;5;6;8 3
main;b;1;2;3;4;5;6;9 3
main;d;1;2;3;4;5;6;7;8;9 5

And we get this:

We see here that stacks are sorted alphabetically and ordered from left to right. The ordering has nothing to do with the order in the file. The collapsed stacks format is itself an aggregation with no view on timing. So the order from left to right is only about merging, not time or anything else. We can see that stacks which share a common parent naturally aggregate under that parent. The width of each frame is it's relative total-time share. It's self-time share is it's top exposure, or how much of it is not covered by it's callees, the frames on top of it.

Tree View vs Flames

Now that we got the hang of this flamy thing, lets take a look at the same profile using 2 presentations. The venerated tree-view and this new hipsterish whatever flame thing. The following is a profile collected using honest-profiler for a netty benchmark:

In typical workflow I step further and further into the hot stack, but this pushes out the big picture out of my view. I would now have to go back up and fold it to see what hides under other hot branches in the tree. It's a familiar and annoying experience if you've ever used a profiler with this kind of view. The problem is that Java class and method names are typically long, and stacks are quite deep. This is a simple application and I quickly run out of room.
Here's the FlameGraph for the same profile (I chose green, because later it makes sense):

NOTE: I made all the flame graphs in this post narrow so they fit the layout. They don't have to be this narrow. You can set the width to whatever you like, I used "--width=700" for the graphs in this post.
We can see the root frames quickly break out to main activities, with the deep netty stack now visible upfront. We can click and zoom easily. I also find the search ability which colors matching strings useful to highlight class/package participation in the profile. Prominent flat-tops indicate hot leaf methods we might want to look at.
It's perhaps a matter of taste, but I love it. I've been using flame graphs for a while and they are pure genius IMO. I find the image itself is intuitively more approachable, and with the ability to quickly zoom in/out and search I can usually quickly work through a profile without losing sight of the full picture.
So how do you get one?

It's Bob! yay?

Everybody Gets A FlameGraph!

Yes, even you poor suckers running JDK 1.3 on Windows XP! I don't recommend this method of profiling if you have a more recent JVM, or if your JVM supports AsyncGetCallTrace, but if your deployment is stuck in the past you can still be in on this. This is because ALL JVMs must support JVMTI and AFAIK allow you to hit them with jstack/JVisualVM/hprof. It's a terrible way to profile, there's allot of overhead, and usually you can find a better way, but this is universally available. Collecting a sample via jstack is (a terrible idea) quite easy. Just find the pid of the process you want to profile using jps and then do something like:
for i in {1..100}; do
jstack <pid> >> iloveoldshit.jstk;
sleep 0.1;
done

And Bob is your relative (which is a good thing apparently).

Once you've collected a large enough sample for your application you can go on and feed flame graphs:

cat iloveoldshit.jstk | ./stackcollapse-jstack.pl | ./flamegraph.pl --color=green > jstack-flames.svg

And you get:

This is the same benchmark from before, but different profile with the safepoint bias. You can compare the two by scrolling up and down. OR you can use FlameGraphs to diff the 2, in a moment.
FlameGraphs supports converting jstack output into collapsed stacks (as above). Efforts exist on GitHub to convert the hprof format which JVisualVM produces (as well as other profilers) into collapsed stack format.
So for all my poor readers, struggling under the oppression of JVMs for which better means to spill out stack samples do not exist, I got your backs, you too can be looking at flame graphs!
But seriously, just move on.

Level Up: AsyncGetCallTrace or JFR

Now, if you want a better profiler, which does not result in bringing your application to safepoint and pausing ALL your threads at each sample AND your are either running a 1.6 or higher JDK (OpenJDK/Oracle/recent Zing) on Linux/Mac/BSD you can use Honest-Profiler to collect your profile. If you got Oracle JDK 1.7u40 or higher you can use Java Flight Recorder (if you use it in production you need to pay for the licence). These profilers rely on AsyncGetCallTrace to record the Java stack safely on interrupt(from a signal handler, not at safepoint).
To collect with Honest-Profiler I start my JVM with the following parameter:
-agentpath:$HONEST_PROFILER_HOME/liblagent.so=host=localhost,port=4242,logPath=$PWD/netty.hpl
Then, when I feel the time is right, I can start and stop the profile collection
echo start | nc localhost 4242
echo stop | nc localhost 4242
To convert the binary format into collapsed stacks I need to use a helper class in the honest-profiler.jar:
java -cp $HONEST_PROFILER_HOME/honest-profiler.jar com.insightfullogic.honest_profiler.ports.console.FlameGraphDumperApplication netty.hpl netty.cstk

I can then feed the flamegraph script the collapsed stacks file and get the result which we've already seen.
To convert JFR recordings to flame graphs see this post. But remember children, you must pay Oracle if you use it in production, or Uncle Larry might come for you.

Bonus: Diff Profiles!

A nice benefit of having 2 different profilers produce (via some massaging) a unified format for flame graphs is that we can now diff profiles from 2 profilers. Not something that is generally useful, granted. But diffing profiles is an established capability in many profilers, and is usually used to do a before/after comparison. Flame Graphs support this via the same visualization. Once you have converted your profiles into the collapsed stacks format you can produce a diff file and graph it:

./difffolded.pl -n A.cstk B.cstk | ./flamegraph.pl > A-B-diff.svg

Diffing the Honest-Profiler and jstack profiles gives us the following:

The white squares are not interesting, the red/pink squares highlight the delta of self samples as a percentage of total samples (not so intuitive). I admit it may seem a tad confusing at first, but at least it draws your eyes to the right places. More on differential flame graphs here.
Note: to make this diff work I had to shave off the thread names from the jstack collected collapsed stacks file.

Further Bonus: Icicle Graph

Some times the bottleneck is not a particular call stack plateau, but rather a particular method being called from many call sites. This kind of bottleneck will not show well in a flame graph as the different stacks with similar tops will be split and may not stand out. This is really where a flat profile is great, but we can also flip the flame graph view to highlight the top method merging:

cat netty.cstk | ./flamegraph.pl --reverse --invert --color=green > netty-icicles.svg

I've filtered the stacks to show only a relevant portion:

This is not dissimilar to functionality offered by other profiler GUIs which allow the drill down direction to start from hot methods and into their callers in a tree view presentation.

Level UP++: Java Perf Flame Graphs FTW!

If you are so fortunate as to:

Be running OpenJDK/Oracle 1.8u60 or later(this functionality is coming to Zing in a near future release, fingers crossed)
Running on Linux
Got permissions to make this run

You can get amazing visibility into your system by using a combination of:

Java with: -XX:+PreserveFramePointer (also recommended -XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints so unfolded frame are more accurate)
perf-map-agent: This attachable agent dumps a symbol mapping file for all runtime generated code in the JVM, enabling perf to correctly resolve addresses to methods. You'll need to clone and build.
perf: You'll need permissions and you'll need to install it. I assume you are root of you own machine for simplicity.

With the above working you can collect a perf profile of your Java process(by itself or as part of whole system). This results in a perf.data file and a perf-<pid>.map file in your /tmp folder. You can then proceed to generate a collapsed stack profile from that file, the simplest way to get this going is by using a script packed with perf-map-agent:

perf-java-flames <pid>

This will ask for password as it needs to sudo a few things. Be the sudo you want to see in the world. After a suspenseful wait of 15 seconds you'll get this:

Note: Netty deploys it's native lib into tmp, loads it and deletes it, which means perf gets lost looking for it. I deleted it from the benchmarks jar and loaded it directly using LD_LIBRARY_PATH to resole this visibility issue. It doesn't make a huge difference, but in case you try this out.

The green frames are Java, and the rest are all the magic which happens to help Java along. Here's what happened:

Red frames are C library or Kernel code. We can see that the socket writes in the original profile actually go into native land. We now have further visibility down the hole. Importantly this illustrates where hot Java methods are in fact not Java methods at all and so looking at their Java code for optimisation ops is futile.
Yellow frames are C++. We can see the call chain leading into the interpreter and ultimately into compiled Java code.
Green frames are Java. BUT if you compare the previously presented profiles will this one you will notice there are some intermediate frames missing here. This is because the frames in this profile are "real" frames, or rather they map to stack frame. Inlined methods in Java do not have their own stack frames, so we can't see them (for now, we'll sort this out in a second). Further more, the keen observer will notice the familiar "Thread.run" bottom of the stack is missing, replaced by the "interpreter". As is often the case, the run method did not get compiled in this benchmark so it is not a proper compiled method for which we have a mapping. Methods in the interpreter are opaque in this profile.
Some stacks are broken, which can be confusing. In the example above we can see the 13.8 unknown chunk which leads to some JVM activities, but also to some Java code. More on that later.

So, it would seem that we have gained something in visibility into the native/OS/JVM CPU utilization, but lost allot of information we had in the Java side. When is this still useful:

This profile is super useful if you are writing Netty and trying to workout which system calls you end up with from your JNI code, or where time is spent in that code (netty implements it's own native epoll selector, very cool). If you are writing an application which utilizes JNI libraries this profile will give you visibility across the divide. The alternative here would be to use 2 profilers and try and correlate them. Solaris Studio also offers some help here, I will one day write a post on Solaris Studio.
This in not a good example of a profile dominated by JVM threads, but in many profiles the GC activity will show up. This is very useful, as GC and compiler CPU utilization can get in the way of application threads using the available CPU. A Java only profiler leaves you to correlate GC/compilation logs and application profile to figure out who ate the pie. It's also an interesting view into which part of the GC is to blame.
Some JVM intrinsics are confusing to AsyncGetCallTrace, and invisible to safepoint profilers. The biggest culprit I see is array copy. Array copies will show up as failed samples on AGCT profilers (unless, like JMC they just fail to tell you about failed samples all together). They show up in this profile (search above for arraycopy), but only a tiny bit.
This profile can be collected system wide, allowing you to present a much wider picture and expose machine wide issues. This is important when you are looking at machine level analysis of your application to improve configuration/setup.
In depth view of OS calls can inform your configuration.

'look at me! look at me now!' said the cat.

'with a cup and a cake on the top of my hat!

I can hold up TWO books!

I can hold up the fish!

and a little toy ship!

and some milk on a dish!

and look!

I can hop up and down on the ball!

but that is not all!

oh, no. That is not all...

Bonus: Inlined Frames! Threads! COLOR!

We can win back the inlined frames information by asking perf-map-agent to create a more detailed map file with inlining data. This leads to larger map files, but should be worth it.
You can further tweak the command line to color kernel frames differently and control sample duration and frequency. And while we're a-tweakin' lets also have threads info.
Here's what you run:

PERF_COLLAPSE_OPTS="--kernel --tid" PERF_RECORD_FREQ=99 PERF_RECORD_SECONDS=10 PERF_MAP_OPTIONS=unfoldall perf-java-flames <pid>

And the graph now looks like this:

The Java frames are now green and aqua. Aqua frames are inlined frames and green are "real". This information is not presented at all by most profilers, and is pretty inaccessible in others. Here we can instantly see some interesting inlining challenges in the tall towers of same sized frames. The compiler inlines through many, but maybe eventually gives up, maybe there's something to be won by simplifying the abstraction here?
Thread id is added as a base frame. This is helpful in this particular example because there are only 2 interesting threads and I very much want to see this split. It also helps bring back some broken stacks into the fold. Now I can tell these frames belong to the Netty epoll thread. Yay.
Orange frames are kernel frames.
Having the thread id highlights that the none Java frames on the left are from a variety of threads. If we had more GC/Compiler work happening this may become interesting.
Profiling a large application with thread pools this separation by thread may not be what you want... but sometimes it is very helpful, like above. In this benchmark I have a thread generating load and a thread I want to profile, so telling them apart works great. At the moment there's no mapping of threads to thread names, but in future we may be able to easily group thread pools for more meaningful views.

Bonus: Hack It All Up

There's very little code in perf-map-agent, and the scripts would take you 5 minutes to read through. You don't have to use the scripts, you can write your own. You can add or enhance features, it's easy to participate or customize. Dig in, have fun :-)

The FlameGraph scripts are nice and tidy, and the pipeline separation of [profile -> collapsed stacks -> graph] means you can read through them and tweak as you like the bits you care about without caring too much about the rest. While working on this post I played with the diff presentation a bit. It was my first ever interaction with Perl, and I'm not so very bright, and I managed to get what I wanted. Surely someone as fine as yourself can do better.

If you look at Brenden's updates page you'll see many many people are jumping in and tweaking and sharing and making funky things. Go for it!

Summary And Credits

So you can have flame graphs, all of you. And you can feed these with inputs from several sources, each with their own set of pros and cons. It makes a great tool in your tool box and may give you that extra perspective you are missing in your profiling.

There's a few people who deserve mention in the context of the tools above, look them up, they all helped make it happen:

Brendan Gregg: FlameGraphs proud daddy. Brendan has written allot about FlameGraphs, work through his posts and you'll learn plenty.
Johannes Rudolph: Author and maintainer of perf-map-agent.
Jake Lucianni: Contributed flamegraphs support for inlined frames.
Richard Warburton: Author and maintainer of honest-profiler.

Thanks for reading :-) next up some JVM profile analysis

Psychosomatic, Lobotomy, Saw

Tuesday, 14 February 2017

Java Flame Graphs Introduction: Fire For Everyone!