Data discretization made easy with funModeling
tl;dr: Convert numerical variables into categorical, as it is shown in the next image.
⏳ Reading time ~ 6 min.
Let's start!
The package funModeling
(from version > 1.6.6) introduces two
functions— discretize_get_bins
& discretize_df
—that work together
in order to help us in the discretization task.
If you were using the 1.6.6, please see the update note below (Jan-19-2018).
# First we load the libraries
# install.packages("funModeling")
library(funModeling)
library(dplyr)
Let's see an example. First, we check current data types:
df_status(heart_disease, print_results = F) %>% select(variable, type, unique, q_na) %>% arrange(type)
## variable type unique q_na
## 1 gender factor 2 0
## 2 chest_pain factor 4 0
## 3 fasting_blood_sugar factor 2 0
## 4 resting_electro factor 3 0
## 5 thal factor 3 2
## 6 exter_angina factor 2 0
## 7 has_heart_disease factor 2 0
## 8 age integer 41 0
## 9 resting_blood_pressure integer 50 0
## 10 serum_cholestoral integer 152 0
## 11 max_heart_rate integer 91 0
## 12 exer_angina integer 2 0
## 13 slope integer 3 0
## 14 num_vessels_flour integer 4 4
## 15 heart_disease_severity integer 5 0
## 16 oldpeak numeric 40 0
We've got factor, integer, and numeric variables: a good mix! The
transformation has two steps. First, it gets the cuts or threshold
values from which each segment begins. The second step is using the
threshold to obtain the variables as categoricals.
Two variables will be discretized in the following example:
max_heart_rate
and oldpeak
. Also, we'll introduce some NA
values
into oldpeak
to test how the function works with missing data.
# Introducing some missing values in the first 30 rows of the oldpeak variable
heart_disease$oldpeak[1:30]=NA
Step 1) Getting the bin thresholds for each input variable:
discretize_get_bins
returns a data frame that needs to be used in the
discretize_df
function, which returns the final processed data frame.
d_bins=discretize_get_bins(data=heart_disease, input=c("max_heart_rate", "oldpeak"), n_bins=5)
## [1] "Variables processed: max_heart_rate, oldpeak"
# Checking `d_bins` object:
d_bins
## variable cuts
## 1 max_heart_rate 131|147|160|171|Inf
## 2 oldpeak 0.1|0.3|1.1|2|Inf
Parameters:
data
: the data frame containing the variables to be processed.input
: vector of strings containing the variable names.n_bins
: the number of bins/segments to have in the discretized
data.
We can see each threshold point (or upper boundary) for each variable.
Update Jan-19-2018: Some points that differs from version 1.6.6 to 1.6.7:
discretize_get_bins
doesn't create the-Inf
threshold since that value was always considered to be the minimum.- The one value category now it is represented as a range, for example, what it was
"5"
, now it is"[5, 6)"
. - Buckets formatting may have changed, if you were using this function in production, you would need to check the new values.
Time to continue with next step!
Step 2) Applying the thresholds for each variable:
# Now it can be applied on the same data frame or in a new one
# (for example, in a predictive model that changes data over time)
heart_disease_discretized=discretize_df(data=heart_disease,
data_bins=d_bins,
stringsAsFactors=T)
## [1] "Variables processed: max_heart_rate, oldpeak"
Parameters:
data
: data frame containing the numerical variables to be
discretized.data_bins
: data frame returned bydiscretize_get_bins
. If it is
changed by the user, then each upper boundary must be separated by a
pipe character (|
) as shown in the example.stringsAsFactors
:TRUE
by default, final variables will be
factor (instead of a character) and useful when plotting.
Final results and their plots
Before and after
Final distribution:
describe(heart_disease_discretized %>% select(max_heart_rate,oldpeak))
## heart_disease_discretized %>% select(max_heart_rate, oldpeak)
##
## 2 Variables 303 Observations
## ---------------------------------------------------------------------------
## max_heart_rate
## n missing distinct
## 303 0 5
##
## Value [-Inf, 131) [ 131, 147) [ 147, 160) [ 160, 171) [ 171, Inf]
## Frequency 63 59 62 62 57
## Proportion 0.208 0.195 0.205 0.205 0.188
## ---------------------------------------------------------------------------
## oldpeak
## n missing distinct
## 303 0 6
##
## Value [-Inf, 0.1) [ 0.1, 0.3) [ 0.3, 1.1) [ 1.1, 2.0) [ 2.0, Inf]
## Frequency 97 18 54 54 50
## Proportion 0.320 0.059 0.178 0.178 0.165
##
## Value NA.
## Frequency 30
## Proportion 0.099
## ---------------------------------------------------------------------------
p5=ggplot(heart_disease_discretized, aes(max_heart_rate)) +
geom_bar(fill="#0072B2") + theme_bw() +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))
p6=ggplot(heart_disease_discretized, aes(oldpeak)) +
geom_bar(fill="#CC79A7") + theme_bw() +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))
gridExtra::grid.arrange(p5, p6, ncol=2)
Showing final variable distribution:
Sometimes, it is not possible to get the same number of cases per bucket
when computing equal frequency as is shown in the oldpeak
variable.
NA handling
Regarding the NA
values, the new oldpeak
variable has six
categories: five categories defined in n_bins=5
plus the NA.
value.
Note the point at the end indicating the presence of missing values.
More info
discretize_df
will never return anNA
value without transforming
it to the stringNA.
.n_bins
sets the number of bins for all the variables.- If
input
is missing, then it will run for all numeric/integer
variables whose number of unique values is greater than the number
of bins (n_bins
). - Only the variables defined in
input
will be processed while
remaining variables will not be modified at all. discretize_get_bins
returns just a data frame that can be changed
by hand as needed, either in a text file or in the R session.
Discretization with new data
In our data, the minimum value for max_heart_rate
is 71. The data
preparation must be robust with new data; e.g., if a new patient arrives
whose max_heart_rate
is 68, then the current process will assign
her/him to the lowest category.
In other functions from other packages, this preparation may return an
NA
because it is out of the segment.
As we pointed out before, if new data comes over time, it's likely to
get new min/max value/s. This can break our process. To solve this,
discretize_df
will always have as min/max the values -Inf
/Inf
;
thus, any new value falling below/above the minimum/maximum will be
added to the lowest or highest segment as applicable.
The data frame returned by discretize_get_bins
must be saved in order
to apply it to new data. If the discretization is not intended to run
with new data, then there is no sense in having two functions: it can be
only one. In addition, there would be no need to save the results of
discretize_get_bins
.
Having this two-step approach, we can handle both cases.
Conclusions about two-step discretization
The usage of discretize_get_bins
+ discretize_df
provides quick data
preparation, with a clean data frame that is ready to use. Clearly
showing where each segment begin and end, indispensable when making
statistical reports.
The decision of not fail when dealing with a new min/max in new data
is just a decision. In some contexts, failure would be the desired
behavior.
The human intervention: The easiest way to discretize a data frame
is to select the same number of bins to apply to every variable—just
like the example we saw—however, if tuning is needed, then some
variables may need a different number of bins. For example, a
variable with less dispersion can work well with a low number of bins.
Common values for the number of segments could be 3, 5, 10, or 20 (but
no more). It is up to the data scientist to make this decision.
Bonus track: The trade-off art ⚖️
- A high number of bins => More noise captured.
- A low number of bins => Oversimplification, less variance.
Do these terms sound similar to any other ones in machine learning?
The answer: Yes!. Just to mention one example: the trade-off between
adding or subtracting variables from a predictive model.
- More variables: Overfitting alert (too detailed predictive model).
- Fewer variables: Underfitting danger (not enough information to
capture general patterns).
Just like oriental philosophy has pointed out for thousands of years, there is an art in finding the right balance between one value and its opposite.
📌 This article was adapted from the Data Science Live Book - Handling Data Types chapter: https://livebook.datascienceheroes.com/data-preparation.html#data_types. Please go there for a deeper coverage.
Keep in touch:
@pabloc_ds.
~ Thanks for reading 🚀.