### What Ails Application Performance Monitoring: A Data Scientist’s Perspective

APM tools have become popular over the past decade, but how reliable are they for spotting anomalies and alerting?

There are two key reasons to use an Application Performance Monitoring (APM) solution:

**Collect and visualize data for your app**. This data could be response times, error rates, database calls—you get the idea. Almost all APM tools can collect and visualize with minimal configuration and effort. And they all support powerful visualization tools for quick troubleshooting.**Get alerted****when something bad happens.**Your engineers define what is bad for your application, and you set up rules to trigger an alert when this happens.

So what ails APM today? If you said rule-based alerting, bingo!! The number one reason for poor rules-based alerting is that the rules you define are static, whereas your application data is dynamic. Unfortunately, the rules are simple to set up, and so it’s the goto method for the engineering community. The consequence: ineffective monitoring with a barrage of false-positive alerts that make finding real outages like finding a needle in a haystack.

First, let’s look at the nature of the data collected and the two most popular rules in use today.

# Data

We are dealing with standard univariate time-series, where one data point (y) is reported per time step (t).

As an example, let’s say you want to understand the application response times for user login. The APM agent (usually running within your app) collects the response times for all user logins and reports the average of the values at each minute. So you get one average response time value (y) per minute (t) and you repeat this each minute.

The series thus generated has 3 fundamental components that are applicable to any time series:

**Trend**(T) – General change in the level of data (up, down, sideways) over a longer period. Below is a chart of the JVM heap utilization that clearly shows an increasing trend.

**Seasonality**(S) – A pattern of fixed period influenced by seasonal factors (time of day, day of the week, month of year, etc). The chart below depicts the number of software deployments captured in our own Harness platform. You can clearly see the dips in the evenings and nights and over the weekends. The peaks are in the middle of the week when companies actively deploy software.

**Randomness (R) – Random deviations observed from the underlying patterns.**

Any time series can thus be modeled as a function of the Trend (T), Seasonality (S) and Randomness (R). An additive model, for instance, looks like this:

Yt = Tt + St+ Rt

Depending on the application, different models target different components. For instance, trend is a good indicator to track population growth. Now let’s move on to the two most common rules used in alerting.

# Trend Based Alerting

Identifying the trend involves removing seasonality and randomness in the underlying time series. The time series definition becomes:

Yt= Tt

The trend (T) at any time step (t) is usually modeled as a function of the values at previous time steps.

Yt = f(Vt-1 , Vt-2 , …….)

In the context of APM alerting, function *f* is always a form of moving average. Some examples being:

- Simple moving average (SMA)
- Exponential moving average (EMA)
- Exponential weighted moving average (EWMA)

The alert is then raised by comparing Yt(the predicted value) with the actual observed value *Y**t*. For instance, it could be a simple alert rule:

Yt > 2 * Yt

The problem is that almost every single observation will deviate from the trend based on seasonality or randomness. Here is a chart depicting the actual load on our Harness SaaS platform. The blue line is the actual measured load over a month and the orange line is a simple average over a 5-day moving window.

Here is a zoomed-in version showing how the moving average lags behind the actual measured value due to the seasonality and randomness.

Here is a scatter plot of Yt – Yt: the difference between the actual and predicted values. Note that every single point here is eligible for an alert, thereby making your alerts unreliable.

For APM, we need to model the seasonality and randomness for reliable alerts and not the trend, which is in complete contrast to what the trend-based strategy does.

# Standard Deviation-based Alerting

Standard deviation-based alerting models different metrics like errors and response times as a random variable. Modeling is done with the assumption that the averages of samples of different random variables drawn from independent distributions *converge* to the normal distribution (See central limit theorem).

In English, this means that if you are measuring error rates from all service instances, the average of the error rates across the service instances might be normally distributed. Generally, this might be the case if you have a large infrastructure of service instances, high load volume, and a large enough window of observations from which to sample. Even then, it has two major challenges when it comes to alerting:

- The errors in your application or transactions are the most reliable metric for alerting. Looking at our own data and that of our customers, we see that error rates are not normally distributed. The main reason is that it is sparse—a period of 0 errors and then spikes as seen in the chart below.

Errors per minute

With the distribution-based approach, you are alerted for every spike (your mean is close to 0). This is the same as saying “alert me anytime when errors > 0“. Often, the spike is sporadic and not actionable. A better approach for alerting is to look for a sustained pattern that is different from what is observed in the past.

2. The distribution approach eliminates the time dimension. It samples observations over a time period to construct the baseline normal distribution. Typically, an alert is triggered when the observation is greater than three standard deviations from the mean of the baseline normal distribution.

The selected time period should be large enough to obtain enough samples for the distribution to converge to normal. It might be a week’s worth of observations, a month, or longer in the case of sparse transactions.

Below is the distribution of load for the entire Harness SaaS platform over a 1 hour period. Clearly, the number of samples is not enough to be normally distributed

Distribution of Calls/min (1hr window)

The distribution of load over 3 days comes closer to being normally distributed (see chart below). Note that when you go down to the transaction levels, your time window needs to be much longer-at least a few months:

Distribution of Calls/min (3-day window)

Like the trend-based approach above, the distribution approach fails to capture seasonality and randomness. It also discounts any patterns, since the alert is driven by a single observation. Let’s look at a concrete example.

The chart above captures the load on an application we monitor over a 1-hour window during an outage. The green circle depicts normal load for this time of the day, and the red circle shows the load dropping, with the drop persisting for a long while. At 8:15 am, the application finally spikes and crashes. The drop here was actually indicative of an ongoing problem which resulted in the crash later.

You can see the distribution of load for this application over the last 3 days preceding the crash. The distribution is centered around a mean of 10,500 calls/min with a single standard deviation range falling between 6500 calls/min to 14,500 calls/min, as depicted by the blue box plot (just over the X-axis).

With the standard deviation strategy, alerts are typically triggered when the observed values are beyond three standard deviations, which is *calls/min > 20,000 or calls/min < 2,500*.

Let’s overlay the load values observed in the red circle as ‘**X**’s on top of the baseline distribution above.

It is clear that all the ‘**X**’s are within one standard deviation of the mean. The standard deviation approach clearly misses the onset of this outage and would only trigger on a full crash. The longer your baseline time window, the more prevalent are the misses (false negatives). The shorter the time window, the more prevalent are the false positive alerts (not enough data points to obtain a realistic distribution).

# Conclusion

At Harness, we feed all of our monitoring signals from APM providers, databases, internal metric trackers and so on into our own 24/7 Service Guard solution. This uses powerful anomaly detection techniques (no rules) to generate alerts on the underlying signals. Based on the signal, the engineer jumps to the underlying tools and dashboards for further troubleshooting. The benefit is effective monitoring and better sleep :). Whatever tools you use, beware the perils of using rule-based alerting.

Cheers!

Sriram.

## Leave a Reply