The concept of an anomaly is simple; it is an instance that deviates from an expected trend or normal behavior. Understanding and characterizing normal behavior on the other hand is anything but simple. Several log management tools leverage the user’s knowledge of what constitutes normal behavior by asking them to write pre-defined rules or set thresholds on metrics they wish to monitor. This approach becomes infeasible and impractical for environments and infrastructures that are characterized by extremely large scale and complexity. As the complexity increases, the number of rules must increase and ultimately, inter-dependencies become too complicated to represent. Moreover, fixed thresholds can’t adapt to event volume that changes over time or that occurs in intermittent bursts.
At Rocana, we define event volume anomaly detection as the identification of data points which deviate from normal and expected behaviour. For example, in one of our approaches, we use historical data to construct a quantitative representation of the data distribution exhibited by each metric being monitored. New data points are compared against these representations and are assigned a score. A decision is made on whether the new data point is an anomaly, based on a threshold we derive from recent observations of the data. One of the key advantages of this approach is that the thresholds are not static, but rather evolve with the data.
We are developing anomaly detection methods that continuously monitor all metrics rather than perform static profiling or rely on a limited non-evolving historical data set. Restricting an anomaly detection method to a training set that is not continuously evolving means building representative models that become obsolete with time and fail to account for periodicity. Therefore, our focus is on anomaly detection methods that:
Scale, and have low runtime complexity.
Take into account workload patterns including day of the week and hour of the day.
Require no configuration and work based on parameters derived from the data provided.
Result in a low number of false positives and alarms to gain the trust of the user and limit frustration.
Detecting anomalies in event volume is our first step towards enabling the user to be proactive, rather than reactive, when problems or outages occur. It is the first indication that there is something unusual happening in a metric and that it deviates from what has happened in the past.
Our future work will give the user the ability to influence the learning algorithms by providing a feedback loop. The loop will enable the user to flag false positives and hence, improve the ability of the method to detect anomalies more accurately. Another feature that we’re really excited to start working on is to expand the anomaly detection past event volume to include event motifs and patterns. This will entail analyzing all the data generated by a user’s environment in order to build a profile displaying how components normally behave and the order of events that always occur. This is especially useful for monitoring application deployments, and behavior in complex environments.
In the second post of this series, Anomaly Detection Part 2: The Model Selection Problem, we introduce the concept of model selection which is the process of choosing a model from a range of candidate models that best fits the input data. And in the final post of this series, Anomaly Detection Part 3: Choose Your Assumptions Wisely, we discuss some of the assumptions made by anomaly detection techniques and their consequences in the context of probability distributions.