tl;dr: Convert numerical variables into categorical, as it is shown in the next image.
⏳ Reading time ~ 6 min.
funModeling (from version > 1.6.6) introduces two
discretize_df —that work together
in order to help us in the discretization task.
# First we load the libraries # install.packages("funModeling") library(funModeling) library(dplyr)
Let's see an example. First, we check current data types:
df_status(heart_disease, print_results = F) %>% select(variable, type, unique, q_na) %>% arrange(type) ## variable type unique q_na ## 1 gender factor 2 0 ## 2 chest_pain factor 4 0 ## 3 fasting_blood_sugar factor 2 0 ## 4 resting_electro factor 3 0 ## 5 thal factor 3 2 ## 6 exter_angina factor 2 0 ## 7 has_heart_disease factor 2 0 ## 8 age integer 41 0 ## 9 resting_blood_pressure integer 50 0 ## 10 serum_cholestoral integer 152 0 ## 11 max_heart_rate integer 91 0 ## 12 exer_angina integer 2 0 ## 13 slope integer 3 0 ## 14 num_vessels_flour integer 4 4 ## 15 heart_disease_severity integer 5 0 ## 16 oldpeak numeric 40 0
We've got factor, integer, and numeric variables: a good mix! The
transformation has two steps. First, it gets the cuts or threshold
values from which each segment begins. The second step is using the
threshold to obtain the variables as categoricals.
Two variables will be discretized in the following example:
oldpeak. Also, we'll introduce some
oldpeak to test how the function works with missing data.
# Introducing some missing values in the first 30 rows of the oldpeak variable heart_disease$oldpeak[1:30]=NA
Step 1) Getting the bin thresholds for each input variable:
discretize_get_bins returns a data frame that needs to be used in the
discretize_df function, which returns the final processed data frame.
d_bins=discretize_get_bins(data=heart_disease, input=c("max_heart_rate", "oldpeak"), n_bins=5) ##  "Variables processed: max_heart_rate, oldpeak" # Checking `d_bins` object: d_bins ## variable cuts ## 1 max_heart_rate -Inf|131|147|160|171|Inf ## 2 oldpeak -Inf|0.1|0.3|1.1|2|Inf
data: the data frame containing the variables to be processed.
input: vector of strings containing the variable names.
n_bins: the number of bins/segments to have in the discretized data.
We can see each threshold point (or upper boundary) for each variable.
-Inf is not an actual upper boundary: more info in the
Step 2) Applying the thresholds for each variable:
# Now it can be applied on the same data frame or in a new one (for example, in a predictive model that changes data over time) heart_disease_discretized=discretize_df(data=heart_disease, data_bins=d_bins, stringsAsFactors=T) ##  "Variables processed: max_heart_rate, oldpeak"
data: data frame containing the numerical variables to be discretized.
data_bins: data frame returned by
discretize_get_bins. If it is changed by the user, then each upper boundary must be separated by a pipe character (
|) as shown in the example.
TRUEby default, final variables will be factor (instead of a character) and useful when plotting.
Final results and their plots
Before and after
describe(heart_disease_discretized %>% select(max_heart_rate,oldpeak)) ## heart_disease_discretized %>% select(max_heart_rate, oldpeak) ## ## 2 Variables 303 Observations ## --------------------------------------------------------------------------- ## max_heart_rate ## n missing distinct ## 303 0 5 ## ## Value [-Inf, 131) [ 131, 147) [ 147, 160) [ 160, 171) [ 171, Inf] ## Frequency 63 59 62 62 57 ## Proportion 0.208 0.195 0.205 0.205 0.188 ## --------------------------------------------------------------------------- ## oldpeak ## n missing distinct ## 303 0 6 ## ## Value [-Inf, 0.1) [ 0.1, 0.3) [ 0.3, 1.1) [ 1.1, 2.0) [ 2.0, Inf] ## Frequency 97 18 54 54 50 ## Proportion 0.320 0.059 0.178 0.178 0.165 ## ## Value NA. ## Frequency 30 ## Proportion 0.099 ## --------------------------------------------------------------------------- p5=ggplot(heart_disease_discretized, aes(max_heart_rate)) + geom_bar(fill="#0072B2") + theme_bw() + theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) p6=ggplot(heart_disease_discretized, aes(oldpeak)) + geom_bar(fill="#CC79A7") + theme_bw() + theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) gridExtra::grid.arrange(p5, p6, ncol=2)
Showing final variable distribution:
Sometimes, it is not possible to get the same number of cases per bucket
when computing equal frequency as is shown in the
NA values, the new
oldpeak variable has six
categories: five categories defined in
n_bins=5 plus the
Note the point at the end indicating the presence of missing values.
discretize_dfwill never return an
NAvalue without transforming it to the string
n_binssets the number of bins for all the variables.
inputis missing, then it will run for all numeric/integer variables whose number of unique values is greater than the number of bins (
- Only the variables defined in
inputwill be processed while remaining variables will not be modified at all.
discretize_get_binsreturns just a data frame that can be changed by hand as needed, either in a text file or in the R session.
Discretization with new data
In our data, the minimum value for
max_heart_rate is 71. The data
preparation must be robust with new data; e.g., if a new patient arrives
max_heart_rate is 68, then the current process will assign
her/him to the lowest category.
In other functions from other packages, this preparation may return an
NA because it is out of the segment.
As we pointed out before, if new data comes over time, it's likely to
get new min/max value/s. This can break our process. To solve this,
discretize_df will always have as min/max the values
thus, any new value falling below/above the minimum/maximum will be
added to the lowest or highest segment as applicable.
The data frame returned by
discretize_get_bins must be saved in order
to apply it to new data. If the discretization is not intended to run
with new data, then there is no sense in having two functions: it can be
only one. In addition, there would be no need to save the results of
Having this two-step approach, we can handle both cases.
Conclusions about two-step discretization
The usage of
discretize_df provides quick data
preparation, with a clean data frame that is ready to use. Clearly
showing where each segment begin and end, indispensable when making
The decision of not fail when dealing with a new min/max in new data
is just a decision. In some contexts, failure would be the desired
The human intervention: The easiest way to discretize a data frame
is to select the same number of bins to apply to every variable—just
like the example we saw—however, if tuning is needed, then some
variables may need a different number of bins. For example, a
variable with less dispersion can work well with a low number of bins.
Common values for the number of segments could be 3, 5, 10, or 20 (but
no more). It is up to the data scientist to make this decision.
Bonus track: The trade-off art ⚖️
- A high number of bins => More noise captured.
- A low number of bins => Oversimplification, less variance.
Do these terms sound similar to any other ones in machine learning?
The answer: Yes!. Just to mention one example: the trade-off between
adding or subtracting variables from a predictive model.
- More variables: Overfitting alert (too detailed predictive model).
- Fewer variables: Underfitting danger (not enough information to capture general patterns).
Just like oriental philosophy has pointed out for thousands of years, there is an art in finding the right balance between one value and its opposite.
📌 This article was adapted from the Data Science Live Book - Handling Data Types chapter: https://livebook.datascienceheroes.com/data-preparation.html#data_types. Please go there for a deeper coverage.
Keep in touch: @pabloc_ds.
~ Thanks for reading 🚀.