Applying data science to monitoring
Lately, in collaboration with Karan Shah, I’ve been focusing most of my efforts on operational monitoring. We need to know when bad things are happening, are about to happen, or have been happening for a long time.
Monitoring is a good example of a problem that’s easier to state than it is to solve. In practice a lot goes wrong:
- Sometimes we fail to monitor things that we care about.
- Sometimes we monitor things we care about, but we route the alerts to the wrong audience.
- Sometimes we monitor things we care about, and we route alerts to the right audience, but we send too much to that same audience. Alerts can be too fine-grained or too repetitive. We can generate too many false positives. We might mix in too many alerts that we don’t care about. And so forth. The signal-to-noise ratio drops in a way that undermines the monitoring system as a whole.
Some of the causes are organizational in nature. Many flowers bloom at Expedia, and monitoring isn’t an exception. So we see a lot of ad hoc approaches to monitoring and a somewhat fragmented tools landscape. Also, we have a strong self-service culture, and so when we set up a shared, self-service tool, sometimes a tragedy of the commons ensues.
But there are some important technical causes too, and that’s what I want to talk about today. In the rest of this post I’ll share our work here, which I’d describe generally as applying data science (a.k.a. statistics) to monitoring.
As a preliminary, let’s look at how a typical anomaly detection model works.
The point series estimates the “midpoint”, or expected value, of the metric at the time in question, and the band reflects the estimated dispersion of the values. If the observed value goes outside the band then we generate an alert.
Note that the time series here have periodicity, or seasonality as it’s often called. A given time series can and often does have multiple seasonalities. For example, any Expedia time series driven by human traffic will generally have daily, weekly and annual seasonalities.
We’re going to ignore the boring case where there’s some static level beyond which we want warning or error alerts, since that kind of monitoring is trivial. An example would be alerting when your available disk space goes below 10 GB or whatever. There are plenty of cases where static thresholds make sense, but they aren’t blog-worthy.
Anybody who receives monitoring alerts is familiar with the problem of false positives. There are different ways this can happen, but there are two that I see so often that I’ve come to expect them:
- Crummy point series estimates due to tiny samples
- Crummy bands due to neglecting heteroscedasticity
The technical approach for establishing good point estimates is different than the one for establishing good bands, so we’ll treat those separately. Let’s do the point estimates first.
For time series with multiple seasonalities, the typical approach to establishing the point series is to capture a sample of the last n week-over-week values, where n is relatively small, and then use their mean as the estimate. For example, if the current time is 2016-05-28 1:05am, then on this approach we create a sample containing the corresponding values (i.e., 1:05am) on 2016-05-21, 2016-05-14, and 2016-05-07 (n = 3).
The major challenge with this approach is that it’s sensitive to outliers in the historical sample, primarily because the sample is too small. In essence, when there is an outlier in the historical sample, it shows up in the current expectation:
There are some techniques to attempt to deal with this, such as increasing the number of weeks, increasing the number of points per week by capturing a weekly window instead of a single point, using the median instead of the mean (median is more robust against outliers), smoothing the time series, or some combination thereof.
But they don’t work very well. We can’t go back too far in time: what happened in February may have limited relevance for what to expect in July. For a similar reason there are limits to how wide a window we can use in a given week. Anyway, widening the window around an outlier often results in capturing a bunch of additional outliers, which doesn’t really help anything.
The good news is there’s an approach that allows us to increase the sample size in a way that avoids the problems above. It’s called seasonal-trend decomposition using loess, or STL. The intuition is this: we should be able to look at the last (say) four weeks of data in their entirety, even if there are some outliers mixed in, and get a good sense for what the current expectation should be.
High-level, the algorithm takes a time series and decomposes it into three components: seasonal, trend and remainder. We generally take the seasonal component to be the highest frequency seasonality, where the trend will absorb lower frequency seasonalities along with general up/down movement. The remainder series is everything that’s left over once you’ve isolated the seasonal and trend components. Here’s an example for four weeks of bookings data:
The top panel is the original time series, the second is the daily pattern, the third is the weekly/general pattern, and the last is the remainder. We can further isolate the weekly seasonality from the general trend by simply rerunning STL on the trend component in the third panel.
We establish the point series essentially by discarding the remainder, which captures the statistical noise and outliers. This allows us to build a midpoint model, based on the entire four week sample, that’s robust in the face of outliers.
With the point series in place, we need a way to establish the band around the point series.
Besides getting the point series right, we have to get the bands right as well. Again there’s a most common approach, which is to define thresholds based on some fixed percentage of the midpoint. For example, we might say that the upper and lower bounds will be +/- 25% of the midpoint.
Once again the default approach doesn’t work. The problem is a fancy-sounding word called heteroscedasticity, which just means that the dispersion (or variability) in the data isn’t constant across the actual data values. For bookings, what happens is that the dispersion of counts varies (and specifically, grows) with the level. That’s heteroscedasticity. Here’s a scatterplot (courtesy of Danny Ng) illustrating this relationship, including overlays for IQR, MAD and standard deviation:
Moreover, though in absolute terms the residual grows with the level, in relative terms (relative to the level), it gets smaller. Here’s a plot that shows this phenomenon:
That’s why fixed percentages don’t work. Here are a couple of charts that show what happens when you try to use fixed percentages:
The takeaway is that fixed percentages are too static. At low volumes we need relatively wider bands (percentage-wise) to avoid false positives, and at high volumes we need relatively tighter bands (again percentage-wise) to avoid false negatives. I wrote about this if you’re interested in more details.
Our current approach to band modeling is to fit simple nonlinear models to ratio scatterplots of the sort given above. We do that offline using nonlinear regression. There’s more work to do here around model selection; we selected the current model based on qualitative similarity to the dataset, and it would be nice to have a little theory behind it. Or at least to evaluate a range of candidate models.
Update: Big thanks to Danny Ng, who offered a useful simplification to the current band model, an improved visualization, and a theoretical motivation for the model. (TIL: inhomogeneous Poisson process.)
The preceding gives some idea as to the core approach we’re taking toward anomaly detection modeling where seasonality is involved. In the next section I’ll describe current work on building this out to handle monitoring at scale.
I noted above that we build the point series by running STL and throwing away the remainder component. The naive way to do this is to rerun STL on four weeks of data with every monitoring poll (e.g., every five minutes). Keep in mind that we actually run STL twice because we want to isolate not only the daily but the weekly seasonality as well. Now imagine that we want to monitor many thousands of time series. (This happens for bookings because we want to monitor various combinations of product, location, and time, each at different levels of granularity. Similarly for funnel metrics and anything else–there’s a combinatorial explosion of conditions we want to monitor.) Between all the data fetching and the decomposition itself, this quickly becomes an expensive proposition.
All the processing isn’t necessary. The point series is based on seasonal components, and these evolve very slowly. For any given time series we want to monitor, we should be able to run STL to extract the daily and weekly components and simply cache those for some time period (maybe daily).
Caching solves the time problem, but now we need to pay attention to space. We don’t want to cache the seasonality models as full-blown time series, since this gets expensive fast. Instead we want to cache compact representations of the models. There are a couple of techniques we can use here:
- For any given seasonal series, whether daily or weekly, we need cache only a single cycle. That’s because the seasonal component evolves slowly.
- A seasonal cycle is a smooth curve. So we should be able to use curve fitting as a way to reduce (quite dramatically) the size of the model.
After some investigation, curve fitting by B-splines appears to be a promising approach. The idea is to partition the time series into a number of segments, curve fit the individual segments, and tie them together to achieve a smooth result with a good fit to the data. The points where we tie the segments together are called knots. This dramatically reduces the size of a given midpoint model, facilitating caching.
Here are some examples showing B-spline fits to a daily cycle. First, here’s an awful fit based on a single knot:
The only reason I’m showing that is to offer visibility into the curve fitting process. If we have too few knots, then the models won’t be good.
Here’s another example, based on 11 knots:
The fit is very good, and we need to store only 14 model parameters instead of the original 288 data points we would have stored on the naive approach. The space savings for weekly series is even more extreme: about two orders of magnitude.
Note that we’re using uniform knot placement here, but the more typical approach is to place knots at quantiles. We plan to do it that way, and that will probably reduce the number of knots and hence parameters required.
The B-spline modeling is work in progress. I’ll post a follow-up once we’ve firmed up and proven the approach.