Fast data exploration for predictive modeling
The problem: Before modeling, we need to check/change numerical, categorical, NAs, one unique value and high cardinality variables.
The new version of funModeling
1.9.2 was released aimed to have assistance during the prior step in creating machine learning models.
This post has its continues on Automatic data types checking in predictive models
Introduction
data_integrity
function provide information about the format of all the variables, as well as some short stats about NA
values.
This way we can select and transform the variables, keeping them in the format we need.
# install.packages("funModeling")
library(funModeling)
Load the messy data:
library(tidyverse)
data=read_delim("https://raw.githubusercontent.com/pablo14/data-integrity/master/messy_data.txt", delim = ';')
Now we call to data_integrity
function, which returns an integrity
object:
di=data_integrity(data)
Then, summary
function gives us a quick self-explanatory overview :
summary(di)
##
## ◌ {Numerical with NA} num_vessels_flour, thal
## ◌ {Categorical with NA} gender
## ● {One unique value} constant
Now we can apply mutate_at
, select
, or apply other function over certain and specific columns.
In case we need the variable name as a vector of strings, we can use the RStudio bare-combine add-in:
My keyboard shortcut for this lil' function gets quite the workout…
— Mara Averick (@dataandme) July 29, 2019
📺 "hrbraddins::bare_combine()" by @hrbrmstr https://t.co/8dwqNEso0B #rstats pic.twitter.com/gyqz2mUE0Y
The high cardinality max value can be changed using the parameter MAX_UNIQUE
Accessing all the information
If we print the integrity object, we can see a lot of information regarding NA
, numerical, categorical and other types, alongside the high cardinality variables:
di
## $vars_num_with_NA
## variable q_na p_na
## 1 num_vessels_flour 4 0.01320132
## 2 thal 2 0.00660066
##
## $vars_cat_with_NA
## variable q_na p_na
## 1 gender 1 0.00330033
##
## $vars_cat_high_card
## [1] variable unique
## <0 rows> (or 0-length row.names)
##
## $MAX_UNIQUE
## [1] 35
##
## $vars_one_value
## [1] "constant"
##
## $vars_cat
## [1] "gender" "has_heart_disease"
##
## $vars_num
## [1] "age" "chest_pain" "resting_blood_pressure"
## [4] "serum_cholestoral" "fasting_blood_sugar" "resting_electro"
## [7] "max_heart_rate" "exer_angina" "oldpeak"
## [10] "slope" "num_vessels_flour" "thal"
## [13] "heart_disease_severity" "exter_angina" "constant"
## [16] "id"
##
## $vars_char
## [1] "gender" "has_heart_disease"
##
## $vars_factor
## character(0)
##
## $vars_other
## [1] "has_heart_disease2" "fecha" "fecha2"
And each object is accessible to operate quickly:
di$results$vars_num
## [1] "age" "chest_pain" "resting_blood_pressure"
## [4] "serum_cholestoral" "fasting_blood_sugar" "resting_electro"
## [7] "max_heart_rate" "exer_angina" "oldpeak"
## [10] "slope" "num_vessels_flour" "thal"
## [13] "heart_disease_severity" "exter_angina" "constant"
## [16] "id"
Numerical variables with NA
values:
di$results$vars_num_with_NA$variable
## [1] "num_vessels_flour" "thal"
Help page:
help("data_integrity")
New status
function
This is the internal function used in data_integrity
:
status(heart_disease)
## variable q_zeros p_zeros q_na p_na q_inf p_inf type unique
## 1 age 0 0.0000000 0 0.00000000 0 0 integer 41
## 2 gender 0 0.0000000 0 0.00000000 0 0 factor 2
## 3 chest_pain 0 0.0000000 0 0.00000000 0 0 factor 4
## 4 resting_blood_pressure 0 0.0000000 0 0.00000000 0 0 integer 50
## 5 serum_cholestoral 0 0.0000000 0 0.00000000 0 0 integer 152
## 6 fasting_blood_sugar 258 0.8514851 0 0.00000000 0 0 factor 2
## 7 resting_electro 151 0.4983498 0 0.00000000 0 0 factor 3
## 8 max_heart_rate 0 0.0000000 0 0.00000000 0 0 integer 91
## 9 exer_angina 204 0.6732673 0 0.00000000 0 0 integer 2
## 10 oldpeak 99 0.3267327 0 0.00000000 0 0 numeric 40
## 11 slope 0 0.0000000 0 0.00000000 0 0 integer 3
## 12 num_vessels_flour 176 0.5808581 4 0.01320132 0 0 integer 4
## 13 thal 0 0.0000000 2 0.00660066 0 0 factor 3
## 14 heart_disease_severity 164 0.5412541 0 0.00000000 0 0 integer 5
## 15 exter_angina 204 0.6732673 0 0.00000000 0 0 factor 2
## 16 has_heart_disease 0 0.0000000 0 0.00000000 0 0 factor 2
It's another version of df_status
, where percentages are expressed in the range o 0 to 1 (not 0 to 100). More intuitive to use in filters
This is the same object as di$status_now
.
Next realase?
It will contain, based on data_integrity
, an automated data quality test suited for the predictive model we need to run.
Found this task quite important and repetitive when I teach. Hopefully it will save some time!
Further reading
All of these topics are covered in deep in the Data Science Live Book 📗:
Have fun! 🚀