Subscribe to our newsletter

Dynamic analysis on outliers




Treating outliers

Introduction

Outliers are the extreme values that a variable has, depending on the model or requirement, it could be necessary to treat them, either transforming or deleting.

Variable “Income” distribution

01_income_distribution

This is going to be our main variable in this example, which represents customer's income in $. We can observe how there are a few cases with very high values, while on the other hand, there are lots of cases with low/mid values.

If we choose to delete them…

A common question is: “How many cases do we have to leave out?”, we can choose to leave out highest 1%, so we will obtain:

02_income_p99.JPG

Now the distribution looks very similar to last one, except now it reaches $300.000 instead of $500.000.

If we do this process iteratively -deleting highest 1%, and then to that result, we delete again highest 1%, and so on, repeating this process 10 times- we're analyzing different cut-off values in order to leave out extreme values. We obtain a curious result, silhouette remains always similar to:

03_density_simple.JPG



Animating the example

The following animation shows in action this iterative deleting process: As we leave out the highest 1%, silhouette keeps a similar aspect to:

04_fractal_outliers.gif

In other words, there are always lots of people with low/mid income, and just a few number of cases with high income -because of distribution nature-. Axis values change within each iteration.

If we change the histogram plot, by a density one, the result is more similar to zoom on the data left side:

05_fractal_outliers_density.gif

When we delete the lowest or highest values of any variable, what we are doing is a “zoom” to the area where most cases are.



Final thoughts

In this particular case, we could choose to leave out highest 0.5 or 1% of data. However it is not always recommended to delete all outliers, sometimes they represent valuable information such as fraud or a machine failure, or any other event which deserves further inspection.


Contact

  • R code and data available on github

Data Science Heroes Twitter DSH Twitter


Data Science Heroes Facebook DSH Facebook


more posts More DSH posts!

Pabloc

Data Analysis ~ The art of finding order in data by browsing its inner information.