/ data cleaning

Automatic data types checking in predictive models

The problem: We have data, and we need to create models (xgboost, random forest, regression, etc). Each one of them has its constraints regarding data types.
Many strange errors appear when we are creating models just because of data format.

The new version of funModeling 1.9.3 (Oct 2019) aimed to provide quick and clean assistance on this.

Cover photo by: @franjacquier_

tl;dr;code πŸ’»

Based on some messy data, we want to run a random forest, so before getting some weird errors, we can check...

Example 1:

# install.packages("funModeling")
library(funModeling)
library(tidyverse)

# Load data
data=read_delim("https://raw.githubusercontent.com/pablo14/data-integrity/master/messy_data.txt", delim = ';')

# Call the function:
integ_mod_1=data_integrity_model(data = data, model_name = "randomForest")

# Any errors?
integ_mod_1
## 
## βœ– {NA detected} num_vessels_flour, thal, gender
## βœ– {Character detected} gender, has_heart_disease
## βœ– {One unique value} constant

Regardless the "one unique value", the other errors need to be solved in order to create a random forest.

oh wow! need to change that!

Alghoritms have their own data type restrictions, and their own error messages making the execution a hard debugging task... data_integrity_model will alert with a common error message about such errors.

Introduction

data_integrity_model is built on top of data_integrity function. We talked about it in the post: Fast data exploration for predictive modeling.

It checks:

  • NA
  • Data types (allow non-numeric? allow character?)
  • High cardinality
  • One unique value

Supported models πŸ€–

It takes the metadata from a table that is pre-loaded with funModeling

head(metadata_models)
## # A tibble: 6 x 6
##   name         allow_NA max_unique allow_factor allow_character only_numeric
##   <chr>        <lgl>         <dbl> <lgl>        <lgl>           <lgl>       
## 1 randomForest FALSE            53 TRUE         FALSE           FALSE       
## 2 xgboost      TRUE            Inf FALSE        FALSE           TRUE        
## 3 num_no_na    FALSE           Inf FALSE        FALSE           TRUE        
## 4 no_na        FALSE           Inf TRUE         TRUE            TRUE        
## 5 kmeans       FALSE           Inf TRUE         TRUE            TRUE        
## 6 hclust       FALSE           Inf TRUE         TRUE            TRUE

The idea is anyone can add the most popular models or some configuration that is not there.
There are some redundancies, but the purpose is to focus on the model, not the needed metadata.
This way we don't think in no NA in random forest, we just write randomForest.

Some custom configurations:

  • no_na: no NA variables.
  • num_no_na: numeric with no NA (for example, useful when doing deep learning).

Embed in a data flow on production 🚚

Many people ask for typical questions when interviewing candidates. I like these ones: "How do you deal with new data?" or "What are the considerations you have when you do a deploy?"

Based on our first example:

integ_mod_1
## 
## βœ– {NA detected} num_vessels_flour, thal, gender
## βœ– {Character detected} gender, has_heart_disease
## βœ– {One unique value} constant

We can check:

integ_mod_1$data_ok
## [1] FALSE

data_ok is a flag useful to stop a process raising an error if anything goes wrong.

More examples 🎁

Example 2:

On mtcars data frame, check if there is any variable with NA:

di2=data_integrity_model(data = mtcars, model_name = "no_na")

# Check:
di2
## βœ” Data model integrity ok!

Good to go?

di2$data_ok
## [1] TRUE

Example 3:

data_integrity_model(data = heart_disease, model_name = "pca")
## 
## βœ– {NA detected} num_vessels_flour, thal
## βœ– {Non-numeric detected} gender, chest_pain, fasting_blood_sugar, resting_electro, thal, exter_angina, has_heart_disease

Example 4:

data_integrity_model(data = iris, model_name = "kmeans")
## 
## βœ– {Non-numeric detected} Species

Any suggestions?

If you come across any cases which aren't covered here, you are welcome to contribute: funModeling's github.

How about time series? I took them as: numeric with no na (model_name = num_no_na). You can add any new model by updating the table metadata_models.

And that's it.


In case you want to understand more about data types and qualilty, you can check the Data Science Live Book πŸ“—

Have data fun! πŸš€

πŸ“¬ You can found me at: Linkedin & Twitter.

Pablo Casas

Pablo Casas

Data Analysis ~ The art of finding order in data by browsing its inner information.

Read More
Automatic data types checking in predictive models
Share this