/ R

# Anomaly Detection in R

### Introduction

Inspired by this Netflix post, I decided to write a post based on this topic using R.

There are several nice packages to achieve this goal, the one we´re going to review is AnomalyDetection.

### Normal Vs. Abnormal

The definition for abnormal, or outlier, is an element which does not follow the behaviour of the majority.

Data has noise, same example as a radio which doesn't have good signal, and you end up listening to some background noise.

• The orange section could be noise in data, since it oscillates around a value without showing a defined pattern, in other words: White noise
• Are the red circles noise or they are peaks from an undercover pattern?

A good algorithm can detect abnormal points considering the inner noise and leaving it behind. The `AnomalyDetectionTs` in `AnomalyDetection` package can perform this task quite well.

### Hands on anomaly detection!

In this example, data comes from the well known wikipedia, which offers an API to download from R the `daily page views` given any `{term + language}`.

In this case, we've got page views from term `fifa`, language `en`, from `2013-02-22` up to today.

After applying the algorithm, we can plot the original time series plus the abnormal points in which the page views were over the expected value.

Parameters in algorithm are `max_anoms=0.01` (to have a maximum of `0.01%` outliers points in final result), and `direction="pos"` to detect anomalies over (not below) the expected value.

As a result, 8 anomalies dates were detected. Additionally, the algorithm returns what it would have been the expected value, and an extra calculation is performed to get this value in terms of percentage `perc_diff`.

If you want to know more about the maths behind it, google: `Generalized ESD` and time series decomposition

Something went wrong:
Something strange since 1st expected value is the same value as the series has (`34028` page views). As a matter of fact `perc_diff` is 0 while it should be a really low number. However the anomaly is well detected and apparently next ones too. If you know why, you can email and share the knowledge :)

### Discovering anomalies

Last plot shows a line indicating linear trend over an specific period -clearly decreasing-, and two black circles. It's interesting to note that these black points were not detected by the algorithm because they are part of a decreasing tendency (noise perhaps?).

A really nice shot by this algorithm since the focus on detections are on the changes of general patterns. Just take a look at the last detected point in that period, it was a peak that didn't follow the decreasing pattern (occurred on `2014-07-12`).

### Checking with the news

These anomalies with the term `fifa` are correlated with the news, the first group of anomalies is related with the FIFA World Cup (around Jun/Jul 2014), and the second group centered on May 2015 is related with FIFA scandal.

In the LA Times it can be found a timeline about the scandal, and two important dates -May 27th and 28th-, which are two dates found by the algorithm.

### Next step

There is a complete chapter in the Data Science Live Book which covers the outliers treatment issue, which can be seen in a way as some kind of anomalous data. All the examples are in R and the topic is covered from both perspectives, practical and theoretical.

### Data Science Live Book (open source)

📌 Continue learning about machine learning data science with the Data Science Live Book (https://livebook.datascienceheroes.com). Fully available on-line!