Automatic data types checking in predictive models
The problem: We have data, and we need to create models (xgboost, random forest, regression, etc). Each one of them has its constraints regarding data types.
Many strange errors appear when we are creating models just because of data format.
The new version of funModeling
1.9.3 (Oct 2019) aimed to provide quick and clean assistance on this.
Cover photo by: @franjacquier_
tl;dr;code π»
Based on some messy data, we want to run a random forest, so before getting some weird errors, we can check...
Example 1:
# install.packages("funModeling")
library(funModeling)
library(tidyverse)
# Load data
data=read_delim("https://raw.githubusercontent.com/pablo14/data-integrity/master/messy_data.txt", delim = ';')
# Call the function:
integ_mod_1=data_integrity_model(data = data, model_name = "randomForest")
# Any errors?
integ_mod_1
##
## β {NA detected} num_vessels_flour, thal, gender
## β {Character detected} gender, has_heart_disease
## β {One unique value} constant
Regardless the "one unique value", the other errors need to be solved in order to create a random forest.
Alghoritms have their own data type restrictions, and their own error messages making the execution a hard debugging task... data_integrity_model
will alert with a common error message about such errors.
Introduction
data_integrity_model
is built on top of data_integrity
function. We talked about it in the post: Fast data exploration for predictive modeling.
It checks:
NA
- Data types (allow non-numeric? allow character?)
- High cardinality
- One unique value
Supported models π€
It takes the metadata from a table that is pre-loaded with funModeling
head(metadata_models)
## # A tibble: 6 x 6
## name allow_NA max_unique allow_factor allow_character only_numeric
## <chr> <lgl> <dbl> <lgl> <lgl> <lgl>
## 1 randomForest FALSE 53 TRUE FALSE FALSE
## 2 xgboost TRUE Inf FALSE FALSE TRUE
## 3 num_no_na FALSE Inf FALSE FALSE TRUE
## 4 no_na FALSE Inf TRUE TRUE TRUE
## 5 kmeans FALSE Inf TRUE TRUE TRUE
## 6 hclust FALSE Inf TRUE TRUE TRUE
The idea is anyone can add the most popular models or some configuration that is not there.
There are some redundancies, but the purpose is to focus on the model, not the needed metadata.
This way we don't think in no NA
in random forest, we just write randomForest
.
Some custom configurations:
no_na
: no NA variables.num_no_na
: numeric with no NA (for example, useful when doing deep learning).
Embed in a data flow on production π
Many people ask for typical questions when interviewing candidates. I like these ones: "How do you deal with new data?" or "What are the considerations you have when you do a deploy?"
Based on our first example:
integ_mod_1
##
## β {NA detected} num_vessels_flour, thal, gender
## β {Character detected} gender, has_heart_disease
## β {One unique value} constant
We can check:
integ_mod_1$data_ok
## [1] FALSE
data_ok
is a flag useful to stop a process raising an error if anything goes wrong.
More examples π
Example 2:
On mtcars
data frame, check if there is any variable with NA
:
di2=data_integrity_model(data = mtcars, model_name = "no_na")
# Check:
di2
## β Data model integrity ok!
Good to go?
di2$data_ok
## [1] TRUE
Example 3:
data_integrity_model(data = heart_disease, model_name = "pca")
##
## β {NA detected} num_vessels_flour, thal
## β {Non-numeric detected} gender, chest_pain, fasting_blood_sugar, resting_electro, thal, exter_angina, has_heart_disease
Example 4:
data_integrity_model(data = iris, model_name = "kmeans")
##
## β {Non-numeric detected} Species
Any suggestions?
If you come across any cases which aren't covered here, you are welcome to contribute: funModeling's github.
How about time series? I took them as: numeric with no na (model_name = num_no_na
). You can add any new model by updating the table metadata_models
.
And that's it.
In case you want to understand more about data types and qualilty, you can check the Data Science Live Book π
Have data fun! π