In my previous blog post I introduced the concept of an anomaly and provided a high level overview of how we’re approaching the problem of anomaly detection. I mentioned how we use historical data to build models which we use to determine if a new data point is anomalous or not. In this blog post, I’m going to introduce the concept of model selection which is the process of choosing a model from a range of candidate models that best fits the input data.
Classification models assign a variable or input vector to one of a finite number of classes. A classification model is trained on a labelled dataset that includes examples from each class. The model learns to distinguish between them and you end up with a trained classifier. In our case we don’t have the luxury of a user provided labeled dataset or even the definition of what constitutes as an anomalous class.
This makes online anomaly detection challenging because in a sense it’s a one-class classification problem. Only one class is represented in our historical data and our model predicts whether a data point belongs to that distribution. Basically, we are building a model to describe normal behaviour based on what happened in the past, and we assume everything else is bad.
A model is used to describe the data generated and represents all assumptions we supply about the data. The assumptions include and are not limited to the probability distribution which we may think fits the data. So how do we determine what model is suitable and fits our data the best?
Let’s start by taking a quick look at fitting a polynomial to a few data points. At a first glance, this seems pretty trivial as you can control the fit of the polynomial by changing the order which in turn controls the number of free parameters in the model. Our goal here is to minimize the sum of squares error, and we see that the error decreases as the order increases. What’s happening here is that the higher order polynomials become tuned exactly to the training data points we supplied and exaggerate minor changes in the data. This is referred to as overfitting. Models with this problem will lead to very poor predictive performance on a new dataset. Note how the problem of overfitting is resolved by increasing the number of data points (Source code).
One of the most challenging aspects of model selection is determining the values of the free parameters to achieve the best predictive performance on new data. In our example above we can see that the performance of a model on a training set is a poor indicator of how the model will perform on new unseen data. There are several things we can do to better understand how a trained model will behave on unseen data.
If we have access to a large dataset then splitting the data into two sets, training and validation is the simplest solution. Multiple models with different parameter values are trained on the training set and then the model with the best predictive performance on the validation set is chosen. On the other hand, if data is scarce and we can’t afford to hold a validation set we can perform cross-validation. Where, the data available for training is also used to assess performance. The data is partitioned into N groups where N-1 groups are used to train the models that are then evaluated on the remaining group. This however, becomes computationally expensive as the number of groups and complexity of the models increase.
Neither of the above approaches are applicable when building a product for wide-scale application, the models and parameters must work with unknown and unseen data sets. Therefore we cannot determine the optimal model before hand and ship one model with the same parameters to all our users. Adding complexity, our anomaly detection runs in real-time, so distinct build-validate-tune cycles aren’t possible. We also would never ask the user to choose the model she thinks fits her data best and have her start configuring the different parameters. That would be a disaster, especially as the models increase in complexity.
We would never ship a myriad range of models with different parameters and hope one works, but we can leverage historical data. We choose the models that have shown to perform well in our lab environment and learn the parameters from the user’s historical data. This means each model is unique to the metric it is representing and it is unique to that metrics’ distribution. We allow multiple hyperparameters and model types to be compared in a single run at startup to assure that the “best” model is chosen.
Reference: Bishop, Pattern Recognition and Machine Learning, 2006
In the first post of this series, Anomaly Detection Part 1: Real-Time Anomaly Detection in Time Series Data, we provided a high level overview of how we’re approaching the problem of anomaly detection. And in the final post of this series, Anomaly Detection Part 3: Choose Your Assumptions Wisely, we discuss some of the assumptions made by anomaly detection techniques and their consequences in the context of probability distributions.