One of the main challenges associated with online anomaly detection is not knowing anything about the data your algorithm is monitoring for anomalous behaviour. This is why we can’t afford to make assumptions about the underlying data. In fact, the fewer assumptions made the better the approach will perform and scale.

In the first post of this series, we introduced the concept of an anomaly and provided a high level overview of how we’re approaching the problem of anomaly detection. In the second post, we introduced the concept of model selection which is the process of choosing a model from a range of candidate models that best fits the input data. In this final post of the series, we will discuss some of the assumptions made by anomaly detection techniques and their consequences in the context of probability distributions.

As we all know, anomaly detection is an area which has been researched extensively, leading to a myriad of approaches and solutions to the problem. The techniques proposed by the industry and researchers can be segmented into categories based on the approach implemented in the technique. A great overview of these categories can be found in Chandola et al. Anomaly Detection Survey referenced at the end of this post.

One category, which relies on probability distributions and models are statistical anomaly detection techniques. [Anscombe and Guttman 1960] capture the premise of these techniques by this statement: *"An anomaly is an observation which is suspected of being partially or wholly irrelevant because it is not generated by the stochastic model assumed"*.

Statistical techniques could leverage two different types of distributions: parametric or nonparametric. Density estimation involves deriving the probability distribution p(x) of a variable x given a set of finite observations (x1, ... , xN). The problem of density estimation is an extremely difficult one since, technically, a data set could have been created from many different probability distributions and a particular one may never fit the data.

These parametric approaches make assumptions about the distribution of the underlying data. They also derive parameters from the data, such as the mean or standard deviation. For example, parametric approaches could use Gaussian, binomial, or multinomial distributions to model their data.

On the other hand, nonparametric approaches make no assumptions about the data distribution. Nonparametric approaches might still include parameters but they are not used to define the shape of the distribution but rather the complexity of the model resulting from the distribution. Choosing the right distribution ties in strongly with the problem of model selection which I presented in my previous blog post.

When developing statistical anomaly detection techniques that leverage probability distributions, several assumptions are usually made which could lead to undesirable and costly consequences. Below I list and discuss three common assumptions along with examples that demonstrate the disadvantage of such assumptions.

### Assumption 1: The data fits a particular probability distribution

A simple, yet popular, assumption is that the underlying data fits a normal distribution. Anomaly detection techniques that make this assumption flag all data points that are more than 3 standard deviations away from the distribution mean as anomalies. I’ve been asked a countless number of times why not simply do that? It must work!

The logic behind this is that the mean ± 3 standard deviations contain 99.7% of the data instances. The figure below shows a histogram of a 1000 data points derived from a normal distribution with a mean of 100 and standard deviation of 0.4. The green lines are the mean ± 3 standard deviations thresholds. You can see how the majority of the data does indeed lie between those two lines and data far away from the center of the distribution is "anomalous". This approach is tolerable if your data is as pretty as the blue normal data in this graph which is rarely the case.

As a quick example let’s take a look at CPU utilization data:

You can quickly see that assuming this distribution is Gaussian and using the mean and standard deviation bounds as thresholds will not really work because the data does not fit a normal distribution. This in turn will lead to a large number of false positives. The green line in the graph above is the mean + 3 standard deviations derived from cpu utilization containing no anomalous behaviour. Using this method the smaller spikes to the right of the graph, which don’t represent anomalous behaviour, will be mistakenly flagged as anomalous.

### Assumption 2: The test statistic and parameters you chose will work on all datasets

To determine that a point is anomalous you could use a test statistic. There are many different test statistics such as Grubb’s test, Student’s t-test, Cook’s distance, Euclidean distance from a model etc.. The tests themselves usually make assumptions about the data, for instance Grubb’s test assumes that the data is generated by a Gaussian distribution.

Choosing one to begin with is nontrivial and then extending to datasets from all different types of distributions is even more difficult. Let’s take a look at an example where piecewise linear regression was used to model historical data of an arbitrary metric.

When using regression to detect anomalies, a model is fit to historical data, then the residual (difference between the observed value and the predicted value from the model) is computed. The magnitude of the residual from a test statistic could be used to determine the anomaly score. Defining what a suitable “large” residual is to set a threshold and choosing the appropriate test which will apply to unseen metrics is a challenge. Moreover, presence of anomalies in the historical data will influence the regression parameters and will not produce an accurate model, which will lead to false positives. Therefore, assumptions about the data collection and pre-processing step are indirectly made.

### Assumption 3: Nonparametric techniques are the answer

Nonparametric techniques make less assumptions about the data, however they still have their own unique issues.

One of the easiest and simplest nonparametric approaches are counting-based as they use histograms to estimate a density distribution. A histogram is built from historical or training data of size *N* by partitioning the data into bins of width *w *and then counting the number of observations *n** _{i}* falling into each bin.

The estimated density is used to determine if a new set of data points belong to the distribution or not. Let’s take a look at an example of such a density estimation approach.

In the graphs below the same data which is derived from a mixed Gaussian distribution is used for all three plots where the only difference is the bin width starting from the smallest in the first plot to the biggest in the third one. We can see the significant effect the bin size has on the probability density. In the third graph, the three distinct distributions are no longer represented and the model is quite smooth. This demonstrates the importance of estimating suitable parameters and that even nonparametric techniques still require some knowledge of the data.

### Final Thoughts

Parametric methods perform well if the underlying data distribution matches the assumption made. However, because of that very assumption they are very restrictive and cannot be easily applied to unknown data possibly leading to a high false positive rate. Choosing a test statistic to determine if a data point is anomalous or not is challenging, especially when the distributions increase in complexity.

Nonparametric techniques on the other hand make no assumptions about the underlying data, but are still limited in the sense that parameters are required to be properly estimated from the data or the model generated will tremendously suffer. Essentially what we really need are density models that make no assumptions about the data and that lead to models where the complexity isn’t largely affected by the parameters derived from the data or the size of the data.

*References:**Bishop, Pattern Recognition and Machine Learning, 2006**Chandola, Varun, Arindam Banerjee, and Vipin Kumar. "Anomaly detection: A survey." ACM computing surveys (CSUR) 41.3 (2009): 15.*