March 2, 2020

Supercharged jstack: How to Debug Your Servers at 100mph

Table of Contents

A guide for using jstack to debug live Java production servers

jstack is like U2 – it’s been with us since the dawn of time, and we can’t seem to get rid of it. Jokes aside, jstack is by far one of the handiest tools in your arsenal to debug a live production server. Even so, I still feel it’s deeply underused in terms of its ability to pull you out of the fire when things go bad, so I wanted to share a few ways in which you can supercharge into an even stronger weapon in your war against production bugs.

At its core jstack is a super easy tool to show you the stack traces of all Java threads running within a target JVM. Just point it to a JVM process via a pid and get a printout of all the thread stack traces at that point in time. This enables you to answer the age old question of “what is this server doing?”, and bringing you one step closer to understand why it’s actually doing it. The biggest pro about jstack is that it’s lightweight – it doesn’t add any performance overhead to the JVM or changes its execution state (unlike a debugger or profiler).


Since nothing is perfect, jstack has two significant disadvantages. The first is that jstack doesn’t provide you with any variable state other than a call stack, which means that while you might be looking at a stack, you will have no idea as to what is the state that got it there. A good example would be looking at a JVM that’s hanging, where jstack would show you that a large number of threads are executing DB queries or waiting to obtain a connection.


This probably means that some queries are taking too long to execute, causing other threads to either wait for a connection or be denied one. This is one spot where you’d really like to know which query is being executed (or what are its parameters) that’s causing the slowdown, and when it began. This is of course just one example, out of a plethora of scenarios in which some threads are blocked and are reducing the throughput of your application. But unfortunately with jstack, as you’re not getting any variable state – you can’t really tell which thread is to blame.


The second disadvantage to jstack is that it’s not an always-on tool. This means that you have to be there when the problem occurs – which in production can be a rare event. This is even more true in elastic environments where VMs are constantly restarted.
Here comes the good part – let’s take a look at two techniques that can help us overcome these two shortcomings, and make a good tool really great.
[adrotate group=”11″]

Creating stateful thread data

The first question is how can you add state to your jstack print out? The answer is simple and powerful – thread names. While many mistakenly consider a thread name to be an immutable, or an OS-determined property, it’s in fact a mutable and incredibly important trait that each thread has. It’s also the one that gets picked up into your jstack stream, and therein lies the key.


The practical application is that much like logging you should control the thread name once it enters your code through an entry point such as servlet, actor, or scheduler. At that point you’ll want to set its name into a meaningful value that can help you understand the execution context and relevant parameters that can help you isolate the transaction and its contents.


This would most likely include –
1. The purpose of the thread (e.g. processing a message, responding to user request, etc..).
2. The transaction ID which would enable you to identify this specific data flow across different machines and parts of the application.
3. Parameter values such as servlet parameters or the ID of a message being dequeued.
4. The time in which you obtained control of the thread. This last item is critically important for you to know exactly which threads in your code are stuck when you use jstack to observe them.


[java]
Thread.currentThread().setName(Context + TID + Params + current Time,..);
[/java]


This data will mean the difference between looking at a printout such as the one below which doesn’t actually tell us anything about what a thread is doing or why and one which is informative:
“pool-1-thread-1” #17 prio=5 os_prio=31 tid=0x00007f9d620c9800 nid=0x6d03 in Object.wait() [0x000000013ebcc000]
Compare this with the following thread printout –
”Queue Processing Thread, MessageID: AB5CAD, type: AnalyzeGraph,queue: ACTIVE_PROD, Transaction_ID: 5678956, Start Time: 10/8/2014 18:34″
#17 prio=5 os_prio=31 tid=0x00007f9d620c9800 nid=0x6d03 in Object.wait() [0x000000013ebcc000]

What you’re seeing here is a much fuller explanation of what this thread is actually doing. You can easily see its dequeuing messages from an AWS queue, which message it’s analyzing, its type, ID and the transaction ID. And last, but far from least – when did the thread start working on it. This can help you focus very quickly on those threads that are stuck, and see the state which they’re in. From there on out, optimizing and reproducing locally becomes a much easier job.


The alternative here would be to either hope that there’s data in the log files, and be able to correlate data in the logs to this exact thread. Another option would be to attach a debugger in production either locally or remotely. Both not very pleasant and time consuming.


Writing this information in the thread name also helps with traditional logging. Even though most logging frameworks provide thread-based context that can be added to the log, you have to make sure you configure it correctly. Using thread name can also ensure you will have all the data you need in the log.
Note: Some folks may say that thread names are not to be tempered with or changed. I’m a very small believer in this, from both my personal experience in doing so for years and that of many colleagues.

Making jstack always-on

The second challenge we’re faced with when using jstack is that just like a debugger, it’s a tool that you have to manually operate at the moment where the problem is happening to capture the corrupt state. However, there’s a more active way of using jstack to automatically generate printouts when a server hangs or falls below or above a certain threshold. The key is to invoke jstack programmatically just as you would any logging feature from within the JVM whenever specific application conditions are met.
The two key challenges here are when and how do you do it.

How to activate jstack programmatically?

As jstack is a plain OS process, invoking it is fairly straightforward. All you have to do is activate the jstack process and point it at yourself. The kicker here is how to get the pid for your process from within the JVM. There’s actually no standard Java API to do it (at least not until Java 9). Here’s a little snippet that gets the job done (albeit not part of a documented api):


[java]
String mxName = ManagementFactory.getRuntimeMXBean().getName();
int index = mxName.indexOf(PID_SEPERATOR);
String result;
if (index != -1) {
result = mxName.substring(0, index);
} else {
throw new IllegalStateException(“Could not acquire pid using ” + mxName);
}
[/java]


Another minor challenge is directing jstack output into your log. That’s also fairly easy to setup using output stream gobblers. Look here for an example as to how to direct output data printed out by a process which you invoke into your log file or output stream.


While it’s possible to capture the stack trace of running threads internally using getAllStackTraces, I prefer to do it by running jstack for a number of reasons. The first is that this is something that I would usually want to happen externally to the running application (even if the JVM is participating in providing the information) to make sure I don’t impact the stability of the application by making introspective calls. Another reason is that jstack is more powerful in terms of its capabilities such as showing you native frames and lock state, something that’s not available from within the JVM.

When do you activate jstack?

The second decision you need to make is what are the conditions under which you’ll want the JVM to log a jstack. This would probably be done after a warm-up period, when the server falls below or above a specific processing (i.e. request or message processing) threshold. You may also want to make sure you take enough time between each activation; just to make sure you don’t flood your logs under low or high load.
The pattern you would use here is load up a watchdog thread from within the JVM which can periodically look at the throughput state of the application (e.g the number of messages processed in the last two minutes) and decide whether or not a “screenshot” of thread state would be helpful in which case it would activate jstack and log it to file.


Set the name of this thread to contain the target and actual throughput state, so when you do take an automatic jstack snapshot you can see exactly why the watchdog thread decided to do so. As this would only happen every few minutes, there’s no real performance overhead to the process – especially compared to the quality of data provided.


Below is a snippet showing this pattern in action. The startScheduleTask loads up a watchdog thread to periodically check a throughput value which is incremented using a Java 8 concurrent adder whenever a message is being processed.


[java]
public void startScheduleTask() {
scheduler.scheduleAtFixedRate(new Runnable() {
public void run() {
checkThroughput();
}
}, APP_WARMUP, POLLING_CYCLE, TimeUnit.SECONDS);
}
private void checkThroughput()
{
int throughput = adder.intValue(); //the adder in inc’d when a message is processed
if (throughput < MIN_THROUGHPUT) {
Thread.currentThread().setName(“Throughput jstack thread: ” + throughput);
System.err.println(“Minimal throughput failed: exexuting jstack”);
executeJstack(); //see the code on github to see how this is done
}
adder.reset();
}
[/java]


The full source code for preemptively invoking jstack from within your code can be found here.

Service Reliability Management