Mauro Krikorian
June 29, 2021
Within these series I will be sharing with you the result of research and development work we conducted with SOUTHWORKS’ teams during some periods of time over 2020–2021.
As there were lot of people involved through different stages, and not to forget mentioning anyone, I want to thank everybody who was involved into different development projects, researching spikes, generating documentation, and so forth and so on along the way.
You can refer to these two previous posts which are setting the ground for the concepts and techniques I’m going to discuss in this article:
For this particular post, I want to give special thanks to the people that helped me to wrap it up: Pablo Costantini, Derek Fernandez & Ignacio Grayeb — thank you guys!
Also known as outlier detection, anomaly detection is a data mining process used to determine types of anomalies found in a data set and to determine details about their occurrences. In simpler words, anomaly detection deals with the identification of unusual patterns/behavior that doesn’t conform to the usual trend.
An outlier is nothing but a data point that differs significantly from other data points in the given dataset.
Outliers can come in different flavors, depending on the environment: point outliers, contextual outliers, or collective outliers. Point outliers are single data points that lay far from the rest of the distribution. Contextual outliers can be noise in data, such as punctuation symbols when realizing text analysis or background noise signal when doing speech recognition. Collective outliers can be subsets of novelties in data such as a signal that may indicate the discovery of new phenomena.
Anomaly detection is the process of finding the outliers in the data, i.e. points that are significantly different from the majority of the other data points.
Outliers can be of two kinds: univariate and multivariate. Univariate outliers can be found when looking at a distribution of values in a single feature space. Multivariate outliers can be found in a n-dimensional space (of n-features). Looking at distributions in n-dimensional spaces can be very difficult for the human brain, thus we need to train a model to do it for us.
Large, real-world datasets may have very complicated patterns that are difficult to detect by just looking at the data. That’s why the study of anomaly detection is an extremely important application of Machine Learning.
Anomaly detection is based on 3 principles:
Anomaly detection has wide applications across industries. In this section, we’ll present some of the most popular use cases. However, you should have in mind that this list is not exhaustive:
Banking. Anomaly detection can help finding abnormally high deposits. Every account holder generally has certain patterns of depositing money into their account. If there is an outlier to this pattern the bank needs to be able to detect and analyze it; e.g. for money laundering.
Finances. Finding the pattern of fraudulent purchases. Every person generally has certain patterns of purchases which they make. If there is an outlier to this pattern the bank needs to detect it in order to analyze it for potential fraud.
Manufacturing. Abnormal machine behavior can be monitored for cost control. Many companies continuously monitor the input and output parameters of the machines they own. It is a well-known fact that before failure a machine shows abnormal behavior in terms of these input or output parameters. A machine needs to be constantly monitored for anomalous behavior from the perspective of preventive maintenance.
Networking. Detecting intrusion into networks. Any network exposed to the outside world faces this threat. Intrusions can be detected early on using monitoring for anomalous activity in the network.
Healthcare. Detecting fraudulent insurance claims and payments.
It is an unsupervised learning algorithm that identifies anomalies by isolating outliers in the data — based on the Decision Tree algorithm. It does so by randomly selecting a feature from the given set of features and then randomly selecting a split value between the max and min values of that feature. This random partitioning of features will produce shorter paths in trees for the anomalous data points, thus distinguishing them from the rest of the data.
In general, the first step is to construct a profile of what’s “normal”, and then report anything that cannot be considered normal as anomalous. However, the isolation forest algorithm does not work on this principle; it does not first define “normal” behavior, and it does not calculate point-based distances.
As you might expect from the name, Isolation Forest instead works by isolating anomalies explicitly isolating anomalous points in the dataset.
The Isolation Forest algorithm is based on the principle that anomalies are observations that are few and different, which should make them easier to identify. Isolation Forest uses an ensemble of Isolation Trees for the given data points to isolate anomalies.
Isolation Forest recursively generates partitions on the dataset by randomly selecting a feature and then randomly selecting a split value for the feature. Presumably the anomalies need fewer random partitions to be isolated compared to “normal” points in the dataset, so the anomalies will be the points which have a smaller path length in the tree, being the path length the number of edges traversed from the root node.
Using Isolation Forest, we can not only detect anomalies faster but also require less memory compared to other algorithms since, as abnormal data points mostly have a lot shorter tree paths than normal data points, trees in the isolation forest do not need to have a large depth.
A support vector machine is another effective technique for detecting anomalies. A SVM is typically associated with supervised learning, but there are extensions such as One-Class SVM that can be used to identify anomalies as an unsupervised problems (in which training data is not labeled).
As we’ve just presented it, a One-Class Support Vector Machine is an unsupervised learning algorithm that is trained only on the normal data, it learns the boundaries of these points and is therefore able to classify any points that lie outside the boundary as outliers.
In a Support Vector Machine, data points are represented as points in space in such a way that points from different categories are separated by a plane. It’s possible to think of this like a line through data points that separates data of different classes.
New data-points are mapped into the same space and their location relative to the plane is used to predict which categories each point belongs, with the plane being referred to as the decision boundary. In the case where the decision boundary needs to be non-linear (i.e. where classes cannot be separated by a straight line), SVMs also have the ability to project space through a non-linear function (kernels), lifting the data to a space with a higher dimension where a linear decision boundary does separate classes.
These types of networks excel at finding complex relationships in multivariate time series data. This is the way to go when we want to detect anomalies in time series
Long Short-Term Memory (LSTM) networks are a sub-type of the more general Recurrent Neural Networks (RNN). A key attribute of recurrent neural networks is their ability to persist information, or cell state, for using it later on. This makes them particularly well suited for analysis of temporal data that evolves over time. LSTM networks are used in tasks such as speech recognition, text translation and here, in the analysis of sequential data streams for anomaly detection.
One way goes uses LSTMs to build a prediction model — i.e., given current and past values, predict the next few steps in the time-series. Then, error in prediction gives an indication of anomaly. Another way is to directly use LSTM as a classifier with two classes, normal and anomalous.
The prediction model based approach is better when anomalous instances are not easily available, whereas a classifier based approach is more suitable when there are sufficient labelled instances of both normal and anomalous instances.
There is another way where LSTM autoencoders and encoder-decoder frameworks have been used as reconstruction models — where some form of reconstruction error is used as a measure of anomaly. On the idea behind such models the autoencoder is trained to reconstruct the normal time-series and it is assumed that such a model would do badly to reconstruct the anomalous time-series having not seen them during training. In simpler words, the system looks at the previous values over hours or days and predicts the behavior for the next minute. If the actual value a minute later is within a standard deviation, then there is no problem. If it is more, it is an anomaly.
These models have been used for anomaly detection in electrocardiograms (ECG), engines, power demand, network (failures/intrusions), and novelty detection in music, etc.
LOF is a score that indicates how likely it is that a certain data point is an outlier (or anomaly). It detects anomalous points computing the local density deviation of a given data point with respect to its neighbors.
There are two important parameters to consider:
LOF calculates the distance of the {number of neighbors}-th nearest point (if the number of neighbors is three, for example, it calculates the distance between the current point and the third nearest point) and then for each point it computes the density of points inside the calculated distance for that point.
The density of each point will then be compared to the densities of their k neighbors. Considering a data point a, the LOF is basically the average ratio of the densities of the neighbors of a to the density of a. If the ratio is greater than 1, the density of point a is on average smaller than the density of its neighbors and, thus, from point a, we have to travel longer distances to get to the next point or cluster of points than from a’s neighbors to their next neighbors. Keep in mind that the neighbors of a point a may not consider a a neighbor as they have points in their reach which are a lot closer.
The contamination parameter determines the proportion of the most isolated points (points that have the highest LOF scores) to be predicted as anomalies.
The scikit-learn Python library has its own implementation of this algorithm, which has two different methods determined by a parameter named novelty. These methods are:
A Gaussian mixture model is a probabilistic model that assumes that the probability distributions for all points in a dataset can be obtained as a weighted sum of Gaussian (or normal) distributions with unknown parameters.
These models attempt to find different subpopulations within an overall population, without knowing which subpopulation a data point belongs to (thus being an unsupervised learning technique). The goal is to estimate the parameters for each normal distribution component in the dataset.
There are different techniques to implement Gaussian mixture models, the most common of which is expectation-maximization (EM). We will focus on this technique.
The EM method is most commonly used when the number of components k is known. Expectation-maximization is a statistical algorithm that involves an iterative process to determine which points came from which Gaussian component. This process consists of two steps:
The algorithm iterates repeating these two steps until it converges, resulting in a maximum likelihood estimate. The animation below shows a two-component bivariate example adjusting its parameters over multiple iterations until it converges.
An extension of this algorithm, variational inference, adds regularization by integrating information from prior distributions.
This scenario has four different techniques evaluating their results on a credit card fraud detection dataset. The dataset used is labelled although labels are only being used to evaluate results — not to fit models.
For this scenario the following four techniques were applied:
Implementation in Python is available in a Jupyter notebook in our GitHub repository.
The chosen dataset, that is available on Kaggle, contains raw data corresponding to credit card transactions. It has 3075 entries with 11 features and an additional column that indicates whether each transaction was fraudulent or not. For model fitting, this last column was dropped and only used to validate results. Around 15% of the entries are labelled as frauds.
An initial cleansing of the dataset was performed to have a basic version to work with. At this stage, columns that lacked data or that didn’t provide any useful information were discarded, and some minor tweaks were applied — like turning all string values to numbers. The resulting dataset had 9 features out of the original 11. With this minimally preprocessed dataset, each technique was executed in order to chose the one that performed best.
Later on, more preprocessing was done within the dataset, looking for a positive impact on the best technique’s performance. The number of features was reduced down to 6, while additional transformations were applied after analyzing the data, by looking at the distributions for each feature and finding correlations between them:
Each technique’s algorithm has a sensibility parameter that has an impact on the amount of cases that it considers to be outliers. In a real-world scenario, cases marked as outliers (frauds) by an algorithm would then be audited by subject matter experts to make the final decision on whether they are actual frauds or not.
Once you can ‘trust’ on algorithm’s accuracy, the goal would be to catch all fraud cases by auditing as little transactions as possible. In this dataset, about 15% of the transactions are fraudulent, so the “ideal algorithm” would mark for further analysis only that subset which will be the 100% of the frauds within the dataset.
If we compare algorithms performance from that standpoint, and evaluate the efficiency of each algorithm by the ability to submit the smallest and more accurate subset for an auditor to validate, can observe some algorithm like Autoencoder or Gaussian Mixture not doing well at all — as a matter of fact, the last one start to fall in accuracy while getting the subset for validation bigger and bigger.
The curves within the plots below, show the ratio of marked frauds as a function of the ratio of cases required to be audited for each algorithm.
A more simpler, and different, scenario using other of the techniques mentioned through this post was designed. This is a simple spike for detecting abnormal values within a dataset consisting of 2D data points normally distributed, containing a portion of uniformly distributed outliers.
Two different techniques, along with their Python implementation, are described:
The spike can be found within our repository in a Jupyter notebook.
These are two small, independent datasets consisting on 200 normally-distributed data-points, which will represent the normal values, and 50 uniformly-distributed data-points, with higher-highs and lower-lows compared to normal data-points.
Data-points are distributed in the following way:
Base clusters (normally distributed)
Outliers (uniformly distributed)
After fitting both Isolation Forest and One-class Support Vector Machine on the training set, predictions were ran over the test set in order to test how both algorithms extrapolate their knowledge.
Isolation Forest
Metrics report that for the spike, this algorithm achieved:
One-class Support Vector Machine
Metrics report that for the spike, this algorithm achieved:
As you have seen through the article there are many techniques within the unsupervised learning paradigm, and within them, some more aligned to detecting anomalies in a dataset.
As detecting a fraud is finding anomalies in the dataset at the end of the day, is that these techniques are a perfect fit for this scenario. Depending on the attributes your dataset contains and the different pre-processing techniques applied to curate and get the input ready for analysis, some are of them are going to perform better than others — as you have seen within the example showcased above.
It is always a good idea to try many different pre-processing approaches and algorithms, that you will need to fine-tune, in order to understand which one delivers to you the best results. Although for some scenarios, like the ones we showcased above, there are algorithms that are known to perform better than others, a big part of the job is the work your data scientist does within the initial raw dataset to get it ready for the training and validation jobs ahead.
Originally published by Mauro Krikorian for SOUTHWORKS on Medium 29 June 2021