An outlier in data is called an anomaly. It can be a simple spike in data, a collection of multiple spikes, or even an absence of a spike from a pattern. Anomalies are symptoms of expensive business problems ranging from faulty machineries to high level frauds. In this post we will discuss the definition, approach, classification and evaluation of Anomaly Detection.
To give you some perspective how important Anomaly Detection can be, the US Government Accountability Office last year estimated that $141 billion in ‘improper payments’ occurred in 2017, through several federal benefit programs. Fraud, waste, and abuse are generally all classified as improper payments. Furthermore, there is a steady increasing trend of this number from $38 Billion in 2005 to $141 Billion in 2017.
A statistical system to identify any unusual “improper payment” incidents would allow for their further review prior to payment. These anomalies would raise suspicions and ultimately lead to fewer “improper payments.” As you can imagine, there are many similar opportunities in the private sector.
What is Anomaly Detection? Identifying rare events or data points which are significantly different from the majority is called Anomaly Detection. Advanced methods identify regular patterns in data and alert the system when an unanticipated pattern is detected.
Your phone identifying an incoming call as a spam and the text you receive while somebody else steals your credit card are both real-world examples of anomaly detection mechanisms from our everyday life. These algorithms have been in use extensively with several industries like banking & finance, insurance, intrusion detection in computer networking, military and satellite systems etc. The IoT space is another emerging area where Anomaly Detection is being widely adopted.
Significance of Anomaly Detection in Business Processes
In today’s world, data is the core asset for most businesses. Traditional diagnostic systems are equipped with dashboards, alerts and KPIs. The limitation here is that the data sets are so large that the amount of human effort to identify and classify the anomalies would be prohibitive. It is not practical for humans to address these issues at scale for mission critical applications. A potential miss in detecting an anomaly could cost money or, more critically, in some situations, lives. So, this demands an automated system.
For example, a price glitch or a broken machinery could cause direct money leakage for a modern volume-based business. Credit card fraud, if not caught in a timely can cause significant financial loss to the parties involved.
Types of Anomalies
Anomalies are generally classified as:
Point Anomaly: A data point is considered an outlier with respect to the rest of the data. E.g.: A very large amount credit card transaction
Contextual Anomaly: A combination of multiple transactions will be taken into account for finding a sense out of it. It can be based on geographic region or time. E.g.: A gas purchase transaction in New York and a Starbucks transaction in Texas, both within an hour.
Collective Anomalies: A collection of data that stands out from the rest of the larger population of data. E.g.: Multiple transactions in the middle of the night which are found to be unusual.
Role of Trend, Seasonality and Noise in Anomaly Analysis
Trends, seasonality, noise, and other patterns need to be handled carefully while detecting anomalies. If not properly addressed, these factors can skew the results from any method of anomaly detection. Trends and seasonality can be identified by several methods like Decomposition Analysis, Periodogram Analysis etc. Noise can be tricky, however, as there is no general rule for a noise threshold. The treatment of noise will vary mostly based on the scenario being handled and the affordability of false positives and false negatives into the result.
Anomaly Detection Approaches
- Rule-based approach: This is a traditional and straight forward approach. Rules are set based on existing knowledge. This can be implemented using any programming languages, expert systems or event processing systems.
- Simple statistics: For normally distributed data, standard deviation can detect outliers. Z-Score or standard score is how many standard deviations an element is from the mean.
- Inter Quartile Range (IQR) is another method to understand the spread of the mid 50% of the data.
- Hypothesis tests are procedures for determining if a proposition can be rejected based on sample data. There are several tests being used based on scenarios. GRUBBS test, Chi-Square test, Dixon’s Q test are some of those.
- Machine Learning: Following are some of the approaches used to detect anomalies using Machine Learning.
- Principle Component Analysis (PCA)is mostly used if an anomaly is a rare scenario. This method analyzes the features to determine what makes a standard normal. PCA identifies correlations among the variables and derives the best manner in which to combine variables to predict the outcome. As a result, standard statistics can then be used to identify anomalies. Each new input is analyzed to compute its projection on eigenvectors. Normalized error is calculated and is labeled as anomaly score.
- Multivariate Gaussian anomaly detection is a quantitative approach done by calculating probability distribution from data points. It is quite normal to visualize the data and identify outliers using visual exploration, if the sample size is considerably smaller.
- Univariate Time Series anomaly detection is a very popular anomaly detection method. Holt-Winters method or ARIMA modeling would be a choice to apply on cyclic data. Time series with seasonal and trend components can be modeled with Decomposition Analysis and the remainder can be applied against any outlier techniques.
- Supervised Training models are used if you have samples available with anomaly cases. The challenge here is data imbalance. Several technics are used for resampling and balancing the data set to get more accurate results.
- Unsupervised or semi supervised models like K-Means clustering, K-Nearest Neighbor, and Markov chains (HMM) are being used for a wide range of applications.
- Graph analysis is used for scenarios with interconnection of multiple players.
- Augmented Anomaly Detection is a crowed sourcing hybrid approach which uses a feedback loop to identify anomalous cases and uses supervised training models for outlier prediction.
This is the least addressed problem for most of the Anomaly Detection implementations. If an item is identified as anomaly, it could possibly stem from several reasons. A large corporation will have separate departments handling each of these reasons, so getting everything into single bucket will create a bottle neck in handling them. Therefore, bucketing anomalies granularly is important to enable corresponding people work on addressing the issues.
There are several ways for do this. It could be a prediction problem or a rule-based bucketing to classify. Another way of doing this is to crowd source the training of an algorithm. This will make the system more reasonable over time. These types of features make a prescriptive solution and help solve multiple related problems.
The tradeoff between false positives and false negative is the key for tuning any anomaly detection system. This will mostly depend on the business context and use case.
When labeled data are available, Precision-Recall curves can be used to evaluate the performance of unsupervised anomaly detection algorithms. There is another approach called comparative evaluation, where multiple algorithms are ranked and evaluated using single receiver operator characteristic(ROC).
We tried to give an overview of different popular techniques used in anomaly detection without going into a deep discussion. Data analysis, anomaly detection, and labeling results are combined to form the end-to-end approach for anomaly detection. Proper implementation of anomaly detection will bring faster ROI for businesses. It is a pro-active guard rail for business processes to have such automated systems in place.
Bineesh heads of Expeed Software’s India operations and has been with Expeed since 2015. His areas of expertise include database programming and modeling , Data architecture, BI and data warehousing, SQL Server BI tools, data science and machine learning ( SSAS, Azure ML, Python & R). He is a Microsoft Certified IT Professional and has completed a Coursera program on Data Science by Johns Hopkins University