/ data preparation

Fast data exploration for predictive modeling

The problem: Before modeling, we need to check/change numerical, categorical, NAs, one unique value and high cardinality variables.

The new version of funModeling 1.9.2 was released aimed to have assistance during the prior step in creating machine learning models.

Introduction

data_integrity function provide information about the format of all the variables, as well as some short stats about NA values.

This way we can select and transform the variables, keeping them in the format we need.

# install.packages("funModeling")
library(funModeling)

Load the messy data:

library(tidyverse)
data=read_delim("https://raw.githubusercontent.com/pablo14/data-integrity/master/messy_data.txt", delim = ';')

Now we call to data_integrity function, which returns an integrity object:

di=data_integrity(data)

Then, summary function gives us a quick self-explanatory overview :

summary(di)
## 
## ◌ {Numerical with NA} num_vessels_flour, thal
## ◌ {Categorical with NA} gender
## ● {One unique value} constant

Now we can apply mutate_at, select, or apply other function over certain and specific columns.

In case we need the variable name as a vector of strings, we can use the RStudio bare-combine add-in:

The high cardinality max value can be changed using the parameter MAX_UNIQUE

Accessing all the information

If we print the integrity object, we can see a lot of information regarding NA, numerical, categorical and other types, alongside the high cardinality variables:

di
## $vars_num_with_NA
##            variable q_na       p_na
## 1 num_vessels_flour    4 0.01320132
## 2              thal    2 0.00660066
## 
## $vars_cat_with_NA
##   variable q_na       p_na
## 1   gender    1 0.00330033
## 
## $vars_cat_high_card
## [1] variable unique  
## <0 rows> (or 0-length row.names)
## 
## $MAX_UNIQUE
## [1] 35
## 
## $vars_one_value
## [1] "constant"
## 
## $vars_cat
## [1] "gender"            "has_heart_disease"
## 
## $vars_num
##  [1] "age"                    "chest_pain"             "resting_blood_pressure"
##  [4] "serum_cholestoral"      "fasting_blood_sugar"    "resting_electro"       
##  [7] "max_heart_rate"         "exer_angina"            "oldpeak"               
## [10] "slope"                  "num_vessels_flour"      "thal"                  
## [13] "heart_disease_severity" "exter_angina"           "constant"              
## [16] "id"                    
## 
## $vars_char
## [1] "gender"            "has_heart_disease"
## 
## $vars_factor
## character(0)
## 
## $vars_other
## [1] "has_heart_disease2" "fecha"              "fecha2"

And each object is accessible to operate quickly:

di$results$vars_num
##  [1] "age"                    "chest_pain"             "resting_blood_pressure"
##  [4] "serum_cholestoral"      "fasting_blood_sugar"    "resting_electro"       
##  [7] "max_heart_rate"         "exer_angina"            "oldpeak"               
## [10] "slope"                  "num_vessels_flour"      "thal"                  
## [13] "heart_disease_severity" "exter_angina"           "constant"              
## [16] "id"

Numerical variables with NA values:

di$results$vars_num_with_NA$variable
## [1] "num_vessels_flour" "thal"
oh wow! need to change that!

Help page:

help("data_integrity")

New status function

This is the internal function used in data_integrity:

status(heart_disease)
##                  variable q_zeros   p_zeros q_na       p_na q_inf p_inf    type unique
## 1                     age       0 0.0000000    0 0.00000000     0     0 integer     41
## 2                  gender       0 0.0000000    0 0.00000000     0     0  factor      2
## 3              chest_pain       0 0.0000000    0 0.00000000     0     0  factor      4
## 4  resting_blood_pressure       0 0.0000000    0 0.00000000     0     0 integer     50
## 5       serum_cholestoral       0 0.0000000    0 0.00000000     0     0 integer    152
## 6     fasting_blood_sugar     258 0.8514851    0 0.00000000     0     0  factor      2
## 7         resting_electro     151 0.4983498    0 0.00000000     0     0  factor      3
## 8          max_heart_rate       0 0.0000000    0 0.00000000     0     0 integer     91
## 9             exer_angina     204 0.6732673    0 0.00000000     0     0 integer      2
## 10                oldpeak      99 0.3267327    0 0.00000000     0     0 numeric     40
## 11                  slope       0 0.0000000    0 0.00000000     0     0 integer      3
## 12      num_vessels_flour     176 0.5808581    4 0.01320132     0     0 integer      4
## 13                   thal       0 0.0000000    2 0.00660066     0     0  factor      3
## 14 heart_disease_severity     164 0.5412541    0 0.00000000     0     0 integer      5
## 15           exter_angina     204 0.6732673    0 0.00000000     0     0  factor      2
## 16      has_heart_disease       0 0.0000000    0 0.00000000     0     0  factor      2

It's another version of df_status, where percentages are expressed in the range o 0 to 1 (not 0 to 100). More intuitive to use in filters

This is the same object as di$status_now.

Next realase?

It will contain, based on data_integrity, an automated data quality test suited for the predictive model we need to run.
Found this task quite important and repetitive when I teach. Hopefully it will save some time!

Further reading

All of these topics are covered in deep in the Data Science Live Book 📗:


Have fun! 🚀

📬 You can found me at: Linkedin & Twitter.

Pablo Casas

Pablo Casas

Data Analysis ~ The art of finding order in data by browsing its inner information.

Read More
Fast data exploration for predictive modeling
Share this