/ data preparation

# Fast data exploration for predictive modeling

The problem: Before modeling, we need to check/change numerical, categorical, NAs, one unique value and high cardinality variables.

The new version of `funModeling` 1.9.2 was released aimed to have assistance during the prior step in creating machine learning models.

This post has its continues on Automatic data types checking in predictive models

## Introduction

`data_integrity` function provide information about the format of all the variables, as well as some short stats about `NA` values.

This way we can select and transform the variables, keeping them in the format we need.

``````# install.packages("funModeling")
library(funModeling)
``````

``````library(tidyverse)
``````

Now we call to `data_integrity` function, which returns an `integrity` object:

``````di=data_integrity(data)
``````

Then, `summary` function gives us a quick self-explanatory overview :

``````summary(di)
``````
``````##
## ◌ {Numerical with NA} num_vessels_flour, thal
## ◌ {Categorical with NA} gender
## ● {One unique value} constant
``````

Now we can apply `mutate_at`, `select`, or apply other function over certain and specific columns.

In case we need the variable name as a vector of strings, we can use the RStudio bare-combine add-in:

The high cardinality max value can be changed using the parameter `MAX_UNIQUE`

## Accessing all the information

If we print the integrity object, we can see a lot of information regarding `NA`, numerical, categorical and other types, alongside the high cardinality variables:

``````di
``````
``````## \$vars_num_with_NA
##            variable q_na       p_na
## 1 num_vessels_flour    4 0.01320132
## 2              thal    2 0.00660066
##
## \$vars_cat_with_NA
##   variable q_na       p_na
## 1   gender    1 0.00330033
##
## \$vars_cat_high_card
## [1] variable unique
## <0 rows> (or 0-length row.names)
##
## \$MAX_UNIQUE
## [1] 35
##
## \$vars_one_value
## [1] "constant"
##
## \$vars_cat
## [1] "gender"            "has_heart_disease"
##
## \$vars_num
##  [1] "age"                    "chest_pain"             "resting_blood_pressure"
##  [4] "serum_cholestoral"      "fasting_blood_sugar"    "resting_electro"
##  [7] "max_heart_rate"         "exer_angina"            "oldpeak"
## [10] "slope"                  "num_vessels_flour"      "thal"
## [13] "heart_disease_severity" "exter_angina"           "constant"
## [16] "id"
##
## \$vars_char
## [1] "gender"            "has_heart_disease"
##
## \$vars_factor
## character(0)
##
## \$vars_other
## [1] "has_heart_disease2" "fecha"              "fecha2"
``````

And each object is accessible to operate quickly:

``````di\$results\$vars_num
``````
``````##  [1] "age"                    "chest_pain"             "resting_blood_pressure"
##  [4] "serum_cholestoral"      "fasting_blood_sugar"    "resting_electro"
##  [7] "max_heart_rate"         "exer_angina"            "oldpeak"
## [10] "slope"                  "num_vessels_flour"      "thal"
## [13] "heart_disease_severity" "exter_angina"           "constant"
## [16] "id"
``````

Numerical variables with `NA` values:

``````di\$results\$vars_num_with_NA\$variable
``````
``````## [1] "num_vessels_flour" "thal"
``````

Help page:

``````help("data_integrity")
``````

# New `status` function

This is the internal function used in `data_integrity`:

``````status(heart_disease)
``````
``````##                  variable q_zeros   p_zeros q_na       p_na q_inf p_inf    type unique
## 1                     age       0 0.0000000    0 0.00000000     0     0 integer     41
## 2                  gender       0 0.0000000    0 0.00000000     0     0  factor      2
## 3              chest_pain       0 0.0000000    0 0.00000000     0     0  factor      4
## 4  resting_blood_pressure       0 0.0000000    0 0.00000000     0     0 integer     50
## 5       serum_cholestoral       0 0.0000000    0 0.00000000     0     0 integer    152
## 6     fasting_blood_sugar     258 0.8514851    0 0.00000000     0     0  factor      2
## 7         resting_electro     151 0.4983498    0 0.00000000     0     0  factor      3
## 8          max_heart_rate       0 0.0000000    0 0.00000000     0     0 integer     91
## 9             exer_angina     204 0.6732673    0 0.00000000     0     0 integer      2
## 10                oldpeak      99 0.3267327    0 0.00000000     0     0 numeric     40
## 11                  slope       0 0.0000000    0 0.00000000     0     0 integer      3
## 12      num_vessels_flour     176 0.5808581    4 0.01320132     0     0 integer      4
## 13                   thal       0 0.0000000    2 0.00660066     0     0  factor      3
## 14 heart_disease_severity     164 0.5412541    0 0.00000000     0     0 integer      5
## 15           exter_angina     204 0.6732673    0 0.00000000     0     0  factor      2
## 16      has_heart_disease       0 0.0000000    0 0.00000000     0     0  factor      2
``````

It's another version of `df_status`, where percentages are expressed in the range o 0 to 1 (not 0 to 100). More intuitive to use in filters

This is the same object as `di\$status_now`.

## Next realase?

It will contain, based on `data_integrity`, an automated data quality test suited for the predictive model we need to run.
Found this task quite important and repetitive when I teach. Hopefully it will save some time!

All of these topics are covered in deep in the Data Science Live Book 📗:

Have fun! 🚀