21 December 2017 / rstats

Data discretization made easy with funModeling

tl;dr: Convert numerical variables into categorical, as it is shown in the next image.

⏳ Reading time ~ 6 min.

Let's start!

The package funModeling (from version > 1.6.6) introduces two
functions— discretize_get_bins & discretize_df —that work together
in order to help us in the discretization task.

If you were using the 1.6.6, please see the update note below (Jan-19-2018).

    # First we load the libraries
    # install.packages("funModeling")
    library(funModeling)
    library(dplyr)

Let's see an example. First, we check current data types:

    df_status(heart_disease, print_results = F) %>% select(variable, type, unique, q_na) %>% arrange(type)

    ##                  variable    type unique q_na
    ## 1                  gender  factor      2    0
    ## 2              chest_pain  factor      4    0
    ## 3     fasting_blood_sugar  factor      2    0
    ## 4         resting_electro  factor      3    0
    ## 5                    thal  factor      3    2
    ## 6            exter_angina  factor      2    0
    ## 7       has_heart_disease  factor      2    0
    ## 8                     age integer     41    0
    ## 9  resting_blood_pressure integer     50    0
    ## 10      serum_cholestoral integer    152    0
    ## 11         max_heart_rate integer     91    0
    ## 12            exer_angina integer      2    0
    ## 13                  slope integer      3    0
    ## 14      num_vessels_flour integer      4    4
    ## 15 heart_disease_severity integer      5    0
    ## 16                oldpeak numeric     40    0

We've got factor, integer, and numeric variables: a good mix! The
transformation has two steps. First, it gets the cuts or threshold
values from which each segment begins. The second step is using the
threshold to obtain the variables as categoricals.

Two variables will be discretized in the following example:
max_heart_rate and oldpeak. Also, we'll introduce some NA values
into oldpeak to test how the function works with missing data.

    # Introducing some missing values in the first 30 rows of the oldpeak variable
    heart_disease$oldpeak[1:30]=NA

Step 1) Getting the bin thresholds for each input variable:

discretize_get_bins returns a data frame that needs to be used in the
discretize_df function, which returns the final processed data frame.

    d_bins=discretize_get_bins(data=heart_disease, input=c("max_heart_rate", "oldpeak"), n_bins=5)

    ## [1] "Variables processed: max_heart_rate, oldpeak"

    # Checking `d_bins` object:
    d_bins

    ##         variable                     cuts
    ## 1 max_heart_rate 131|147|160|171|Inf
    ## 2        oldpeak   0.1|0.3|1.1|2|Inf

Parameters:

data: the data frame containing the variables to be processed.
input: vector of strings containing the variable names.
n_bins: the number of bins/segments to have in the discretized
data.

We can see each threshold point (or upper boundary) for each variable.

Update Jan-19-2018: Some points that differs from version 1.6.6 to 1.6.7:

discretize_get_bins doesn't create the -Inf threshold since that value was always considered to be the minimum.
The one value category now it is represented as a range, for example, what it was "5", now it is "[5, 6)".
Buckets formatting may have changed, if you were using this function in production, you would need to check the new values.

Time to continue with next step!

Step 2) Applying the thresholds for each variable:

    # Now it can be applied on the same data frame or in a new one 
    # (for example, in a predictive model that changes data over time)
    heart_disease_discretized=discretize_df(data=heart_disease, 
    data_bins=d_bins,
    stringsAsFactors=T)

    ## [1] "Variables processed: max_heart_rate, oldpeak"

Parameters:

data: data frame containing the numerical variables to be
discretized.
data_bins: data frame returned by discretize_get_bins. If it is
changed by the user, then each upper boundary must be separated by a
pipe character (|) as shown in the example.
stringsAsFactors: TRUE by default, final variables will be
factor (instead of a character) and useful when plotting.

Final results and their plots

Before and after

Final distribution:

    describe(heart_disease_discretized %>% select(max_heart_rate,oldpeak))

    ## heart_disease_discretized %>% select(max_heart_rate, oldpeak) 
    ## 
    ##  2  Variables      303  Observations
    ## ---------------------------------------------------------------------------
    ## max_heart_rate 
    ##        n  missing distinct 
    ##      303        0        5 
    ##                                                                       
    ## Value      [-Inf, 131) [ 131, 147) [ 147, 160) [ 160, 171) [ 171, Inf]
    ## Frequency           63          59          62          62          57
    ## Proportion       0.208       0.195       0.205       0.205       0.188
    ## ---------------------------------------------------------------------------
    ## oldpeak 
    ##        n  missing distinct 
    ##      303        0        6 
    ##                                                                       
    ## Value      [-Inf, 0.1) [ 0.1, 0.3) [ 0.3, 1.1) [ 1.1, 2.0) [ 2.0, Inf]
    ## Frequency           97          18          54          54          50
    ## Proportion       0.320       0.059       0.178       0.178       0.165
    ##                       
    ## Value              NA.
    ## Frequency           30
    ## Proportion       0.099
    ## ---------------------------------------------------------------------------

    p5=ggplot(heart_disease_discretized, aes(max_heart_rate)) + 
    geom_bar(fill="#0072B2") + theme_bw() + 
    theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))
    
    p6=ggplot(heart_disease_discretized, aes(oldpeak)) + 
    geom_bar(fill="#CC79A7") + theme_bw() + 
    theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))

    gridExtra::grid.arrange(p5, p6, ncol=2)

Showing final variable distribution:

Sometimes, it is not possible to get the same number of cases per bucket
when computing equal frequency as is shown in the oldpeak
variable.

NA handling

Regarding the NA values, the new oldpeak variable has six
categories: five categories defined in n_bins=5 plus the NA. value.
Note the point at the end indicating the presence of missing values.

More info

discretize_df will never return an NA value without transforming
it to the string NA..
n_bins sets the number of bins for all the variables.
If input is missing, then it will run for all numeric/integer
variables whose number of unique values is greater than the number
of bins (n_bins).
Only the variables defined in input will be processed while
remaining variables will not be modified at all.
discretize_get_bins returns just a data frame that can be changed
by hand as needed, either in a text file or in the R session.

Discretization with new data

In our data, the minimum value for max_heart_rate is 71. The data
preparation must be robust with new data; e.g., if a new patient arrives
whose max_heart_rate is 68, then the current process will assign
her/him to the lowest category.

In other functions from other packages, this preparation may return an
NA because it is out of the segment.

As we pointed out before, if new data comes over time, it's likely to
get new min/max value/s. This can break our process. To solve this,
discretize_df will always have as min/max the values -Inf/Inf;
thus, any new value falling below/above the minimum/maximum will be
added to the lowest or highest segment as applicable.

The data frame returned by discretize_get_bins must be saved in order
to apply it to new data. If the discretization is not intended to run
with new data, then there is no sense in having two functions: it can be
only one. In addition, there would be no need to save the results of
discretize_get_bins.

Having this two-step approach, we can handle both cases.

Conclusions about two-step discretization

The usage of discretize_get_bins + discretize_df provides quick data
preparation, with a clean data frame that is ready to use. Clearly
showing where each segment begin and end, indispensable when making
statistical reports.

The decision of not fail when dealing with a new min/max in new data
is just a decision. In some contexts, failure would be the desired
behavior.

The human intervention: The easiest way to discretize a data frame
is to select the same number of bins to apply to every variable—just
like the example we saw—however, if tuning is needed, then some
variables may need a different number of bins. For example, a
variable with less dispersion can work well with a low number of bins.

Common values for the number of segments could be 3, 5, 10, or 20 (but
no more). It is up to the data scientist to make this decision.

Bonus track: The trade-off art ⚖️

A high number of bins => More noise captured.
A low number of bins => Oversimplification, less variance.

Do these terms sound similar to any other ones in machine learning?

The answer: Yes!. Just to mention one example: the trade-off between
adding or subtracting variables from a predictive model.

More variables: Overfitting alert (too detailed predictive model).
Fewer variables: Underfitting danger (not enough information to
capture general patterns).

Just like oriental philosophy has pointed out for thousands of years, there is an art in finding the right balance between one value and its opposite.

📌 This article was adapted from the Data Science Live Book - Handling Data Types chapter: https://livebook.datascienceheroes.com/data-preparation.html#data_types. Please go there for a deeper coverage.

Book's download page 📥📘

Keep in touch: @pabloc_ds.

~ Thanks for reading 🚀.