Data Science Heroes Blog

funModeling: New site, logo and version 🚀

Pablo Casas — Mon, 15 Jun 2020 15:16:51 GMT

Hi there!

{tl;dr} Website, here ✅

In case you don't know funModeling is the package I've been developing during the last years.

It's focused on exploratory data analysis, data preparation and the evaluation of models.

News

Yesterday I published the latest version which fixes one of the plots in cross_plot. But that's not as funny as the announcement of its new logo!

Also... I added the coord_plot, useful when we are profiling any clustering model:

coord_plot(data=mtcars2, group_var="cluster", group_func=median, print_table=TRUE)

You can choose the summarization function (mean by default). Yeah... no more outlier biases in the mean, long live the percentiles!

Oh... and coord_plot produces, at the same time, a table with the results:

And it shows the underlying funModeling philosophy: little code, graphics and a table with results (easier to operate 🦾).

Blog posts based on `funModeling`:

Official page

funModeling official webpage
Check the vignette here.

Learn Data Science

You can learn and apply more functions using the Data Science Live Book. And buy a digital copy (name your price), here.

Speak Spanish? Want to study #ML? 👉 https://escueladedatosvivos.ai

Do you use funModeling for teaching? Please contact me I want to know more :)

That's all for now!

Twitter | LinkedIn

Tips before migrating to a newer R version

Pablo Casas — Tue, 28 Apr 2020 13:36:51 GMT

This post is based on real events.

Several times when I installed the latest version of R, and proceeded to install all the packages I had in the previous version, I encountered problems. It also applies when updating packages after a while.

I decided to make this post after seeing the community reception to a quick post I made:

This post -also available in Spanish here- does not want to discourage the installation of R, on the contrary, to warn the "dark side" of the migration and make our projects stable over time.

Luckily the functions change for the better, or even much better as it is the case of the tidyverse suite.

🗞 (A little announcement for those who speak Spanish 🇪🇸) 3-weeks ago I create the data school: EscuelaDeDatosVivos.AI, where you can find an introductory free R course for data science (which includes the tidyverse and funModeling among others) 👉 Desembarcando en R

Projects that are not frequently executed

For example, post migration in the run to generate the Data Science Live Book (written 100% in R markdown), I have seen function depreciation messages as a warning. Naturally I have to remove them or use the new function.

I also had the case where they changed some of the parameters of R Markdown.

Another case 🎬

Imagine the following flow: R 4.0.0 is installed, then the latest version of all packages. Taking ggplot as an example, we go from, 2.8.1 to 3.5.1.

Version 3.5.1 doesn't have a function because it is deprecated, ergo it fails. Or even changed a function (example from tidyverse: mutate_at, mutate_if). It changes what is called the signature of the function, e.g. the .vars parameter.

Package installation

Well, if we migrate and don't install everything we had before, we're going to run an old script and have this problem.

Some recommend listing all the packages we have installed, and generating a script to install them.

Another solution is to manually copy the packages from a folder of the old version of R to the new one. The packages are folders within the installation of R.

R on servers

Another case, they have R installed on a server with processes running every day, they do the migration and some of the functions change their signature. That is, they change the type of data that perhaps is defined in a function.

This point should not occur often if one migrates from package versions often. The normal flow for removing a function from an R package is to first announce a with a warning the famous deprecated: Mark a function as deprecated in customised R package.

If the announcement is in an N+1 version, and we switch from N to the N+2 version, we may miss the message and the function is no longer used.

So it is not advisable to upgrade packages and R?

As I said at the beginning, of course I encourage the migration.

We must be alert and test the projects we already have running.

Otherwise, we wouldn't have many of the facilities that today's languages give us through the use of the community. It is not even dependent on R.

📝 Now that the tidymodels is out, here's another post that might interest you: How to use recipes package from tidymodels for one hot encoding 🛠

Some advice: Environments

Python has a very useful concept that is the virtual environment, it is created quickly, and what it causes is that each library installation is done in the project folder.

Then you do pip freeze > requirements.txt and all the libraries with their version remain in a txt with which they can quickly recreate the environment with which they developed. Why and How to make a Requirements.txt

This is not so easy in R, there is packrat but it has its complexities, for example if there are repos in github.

Augusto Hassel just told me about the renv library (also from RStudio! 👏). I quote the page:

"The renv package is a new effort to bring project-local R dependency management to your projects. The goal is for renv to be a robust, stable replacement for the Packrat package, with fewer surprises and better default behaviors."

You can see the slides from renv: Project Environments for R, by Kevin Ushey.

Docker

Augusto also told me about Docker as a solution:

"Using Docker we can encapsulate the environment needed to run our code through an instruction file called Dockerfile. This way, we'll always be running the same image, wherever we pick up the environment."

Here's a post by him (in Spanish): My First Docker Repository

Conclusions

✅ If you have R in production, have a testing environment and a production environment.

✅ Install R, your libraries, and then check that everything is running as usual.

✅ Have unit test to automatically test that the data flow is not broken. In R check: testthat.

✅ Update all libraries every X months, don't let too much time go by.

As a moral, this is also being data scientist, solving version, installation and environment problems.

Moss! What did you think of the post?

Happy update!

📬 Find me at: Linkedin & Twitter.

SPAM detection using fastai ULMFiT - Part 1: Language Model

Pablo Casas — Mon, 23 Dec 2019 15:39:20 GMT

tl;dr: 👉 show me the code! 🔥 here 🔥

UPDATE Feb.21.2020 Part 2, the classification model, is here

Non-technical introduction

Imagine you are a lawyer, that wants to study medicine; although it is a huge change, the underlying idea is you know how to speak in English, know the semantics to create a text, and the language rules.

So when you jump into medicine, you don't have to learn from scratch that after the word "They", it comes the word "were" (not "was").

You only learn the particularities of the domain field (medicine).
But what is ULMFiT? 📚🤖

ULMFiT stands for Universal Language Model Fine-tuning, and its implementation is in fastai pythons library.

Why is it useful? 🤔

It allows us to save time when creating an NLP project, thanks to the transfer learning technique, we do only need to fine-tune the network to our data. Let's say, it learns the domain field words.

Especially handy if we don't have lots of data.

About google colab

Not new, but google colab is a tool that allows us to run notebook python projects using the GPUs from google servers. It's free!

This two-blog post series can be run in your browser, only by executing all the cells! Time to play :)

Besides running the uploaded version, you can copy the project directly to your google drive and do all the practice you want! (File -> Save a copy in drive)

Going more technical

The project is split into:

1- Create the language model
2- Create the classification model

The language model is what handles the word and semantics representations, and it can be chained to the classification model quickly.

I suggest you read Universal Language Model Fine-tuning for Text Classification. It was created by Jeremy Howard and Sebastian Rude.

ULMFit contains a network that was trained on a corpus of 103MM Wikipedia articles. So it already knows how to speak "neutral".

Source: arXiv:1801.06146v5

Part 1: of this post is about section (a) and (b): Download pre-trained language model and do the fine-tuning with our data.
Part 2: (c) create the classification model.

📚 Learn more from:

Official web page: http://nlp.fast.ai/
fastai youtube lesson: https://youtu.be/vnOpEwmtFJ8?t=4511 (it starts at ULMFiT stage)

Code 💻

This blog post assumes you have some prior knowledge in deep learning. But if not, I encourage you to run all the projects and playing by doing little changes in the code, and see what happens!

Some of the topics covered in the google colab, are:

Pretrained model advantages (transfer learning)
ULMFiT in other languages? (other than English)
What is an embedding?
How to train a language model

📌 Run the project here 👉 google colab

UPDATE Feb.21.2020 Don't forget to check Part 2, the classification model, here

Have data fun! 🚀

📬 Find me at: Linkedin & Twitter.
Data Science Live Book 📗

How Auth0’s Data Team uses R and Python

Pablo Casas — Tue, 03 Dec 2019 16:24:24 GMT

The Data team is responsible for crunching, reporting, and serving data. The team also does data integrations with other systems, creating machine learning, and deep learning models.

With this post, we intend to share our favorite tools, which are proven to run with thousands of millions of data.
Scaling processes in real-world scenarios is a hot topic among new people coming to data.

This post first appeared at: https://auth0.com/blog/how-the-auth0-data-team-uses-r-and-python/

R or Python?

Well... both!

R is a GNU project, thought as a statistical data language originally developed at Bell Laboratories around 1996.

Python, developed in 1991 by Guido van Rossum, is a general-purpose language with a focus on code readability.

Both R and Python are highly extensible through packages.

We mainly use R for our data processes and ML projects, and Python to do the integrations and Deep Learning projects.

Our stack is R with RStudio, and Python 3 with Jupyter notebooks.

RStudio is an open-source and vast IDE capable of browsing data and objects created during the session, plots, debugging code, among many other options. It also provides an enterprise-ready solution.

Jupyter is also an open-source IDE aimed to interface Julia, Python, and R. Today's is widely used for data scientists to share their analysis. Recently Google creates "Colab", a Jupyter notebook environment capable of running in the google drive cloud.

So is R capable of running on production?

Yes.

We run several heavy data preparations and predictive models every day, every hour, and every few minutes.

How do we run R and Python tasks on production?

We use Airflow as an orchestrator, an open-source project created by Airbnb.

Airflow is an incredible and robust project which allows us to schedule processes, assign priorities, rules, detailed log, etc.

For development, we still use the form: Rscript my_awesome_script.R.

Airflow is a Python-based task scheduler that allows us to run chained processes, with many complex dependencies, monitoring the current state of all of them and firing alerts if anything goes wrong to Slack. This is ideal for running import jobs to populate the Data Warehouse with fresh data every day.

Do we have a data warehouse?

Yes, and it's huge!

It's mounted on Amazon Redshift, a suitable option if scaling is a priority. Visit their website to learn more about it.

R connects directly to Amazon Redshift thanks to the rauth0 package, which uses the redshiftTools package, developed by Pablo Seibelt.

Generally, data is uploaded from R to Amazon Redshift using redshiftTools.
This data can be either plain files or from data frames created during the R session.

We use Python to import and export unstructured data since R does not have useful libraries currently to handle it.

We have experimented with JSON libraries in R but the result is much worse than using Python in this scenario. For example, using RJSONIO the dataset is automatically transformed into an R Data Frame, with little control of how the transformation is done. This is only useful for very simple JSON data structures and is very difficult to manipulate in R, compared to Python where this is much easier and more natural.

How do we deal with data preparation using R?

We have two scenarios, data preparation for data engineering, and data preparation for machine learning/AI.

One of the biggest strengths of R is the tidyverse package, which is a set of packages developed by lots of ninja developers, some of them working at RStudio Inc company. They provide a common API and a shared philosophy for working with data. We will cover an example in the next section.

The tidyverse, especially the dplyr package, contains a set of functions that make the exploratory data analysis and data preparation quite comfortable.

For certain tasks in crunching data prep and visualization, we use the funModeling package. It was the seed for an open-source book I published some time ago: Data Science Live Book.
It contains some good practices we follow related to deploying models on production, dealing with missing data, handling outliers, and more.

Does R scale?

One of the key points of dplyr is it can be run on databases, thanks to another package with a pretty similar name: dbplyr.

This way, we write R syntax (dplyr) and it is "automagically" converted to SQL syntax and it then runs on production.

There are some cases in which these conversions from R to SQL are not made automatically. For such cases, we are still able to do a mix of SQL syntax in R.

For example, following dplyr syntax:

flights %>%
group_by(month, day) %>%
summarise(delay = mean(dep_delay))

Generates:

SELECT month, day, AVG(dep_delay) AS delay
FROM nycflights13::flights
GROUP BY month, day

This way, dbplyr makes transparent for the R user working with objects in RAM or in a foreign database.

Not many people know, but many key pieces of R are written in C++ (concretely, the Rcpp package).

How do we share the results?

Mostly in Tableau. We have some integrations with Salesforce.

In addition, we do have some reports deployed in Shiny. Especially the ones that need complex customer interaction.
Shiny allows custom reports to be built using simple R code without having to learn Javascript, Python or other frontend and backend languages. Through the use of a "reactive" interface, the user can input parameters that the Shiny application can use to react and redraw any reports. In contrast with tools like Tableau, Domo, PowerBI, etc. which are more "drag and drop", the programmatic nature of Shiny apps allow them to do almost anything the developer can conceive in their imagination, which might be more difficult or impossible in other tools.

For ad hoc reports (HTML), we use R markdown which shares some functionality with to jupyter notebooks. It allows a script to be created with an analysis that ends in a dashboard, PDF report, web-based reports, and also books!

Machine Learning / AI

We use both R and Python.

For Machine Learning projects, we use mainly the caret package in R. It provides a high-level interface to many machine learning algorithms, as well as common tasks in data preparation, model evaluation, and hyper-tuning parameter.

For Deep Learning, we use Python, specifically the libraries Keras with TensorFlow as the backend.
Keras is an API to build with just a bunch of lines of code, many of the most complex neural networks. It can easily scale by training them on the cloud, in services like AWS.

Nowadays we are also doing some experiments with the fastai library for NLP problems.

Summing up!

The open-source languages are leading the data path. R and Python have strong communities, and there are free and top-notch resources to learn.

Here we wanted to share the not-so-common approach of using R for data engineering tasks, what are our favorite and Python libraries, with a focus on sharing the results, explaining some of the practices we do every day.

We think the most important stages in a data project are the data analysis and data preparation. Choosing the right approach can save a lot of time and make the project to scale.

We hope this post encourages you to try some of the suggested technologies and rock your data projects!

Any Questions? Leave it in the comments 📨

Automatic data types checking in predictive models

Pablo Casas — Mon, 14 Oct 2019 14:50:57 GMT

The problem: We have data, and we need to create models (xgboost, random forest, regression, etc). Each one of them has its constraints regarding data types.
Many strange errors appear when we are creating models just because of data format.

The new version of funModeling 1.9.3 (Oct 2019) aimed to provide quick and clean assistance on this.

Cover photo by: @franjacquier_

tl;dr;code 💻

Based on some messy data, we want to run a random forest, so before getting some weird errors, we can check...

Example 1:

# install.packages("funModeling")
library(funModeling)
library(tidyverse)

# Load data
data=read_delim("https://raw.githubusercontent.com/pablo14/data-integrity/master/messy_data.txt", delim = ';')

# Call the function:
integ_mod_1=data_integrity_model(data = data, model_name = "randomForest")

# Any errors?
integ_mod_1

## 
## ✖ {NA detected} num_vessels_flour, thal, gender
## ✖ {Character detected} gender, has_heart_disease
## ✖ {One unique value} constant

Regardless the "one unique value", the other errors need to be solved in order to create a random forest.

Alghoritms have their own data type restrictions, and their own error messages making the execution a hard debugging task... data_integrity_model will alert with a common error message about such errors.

Introduction

data_integrity_model is built on top of data_integrity function. We talked about it in the post: Fast data exploration for predictive modeling.

It checks:

NA
Data types (allow non-numeric? allow character?)
High cardinality
One unique value

Supported models 🤖

It takes the metadata from a table that is pre-loaded with funModeling

head(metadata_models)

## # A tibble: 6 x 6
##   name         allow_NA max_unique allow_factor allow_character only_numeric
##                                               
## 1 randomForest FALSE            53 TRUE         FALSE           FALSE       
## 2 xgboost      TRUE            Inf FALSE        FALSE           TRUE        
## 3 num_no_na    FALSE           Inf FALSE        FALSE           TRUE        
## 4 no_na        FALSE           Inf TRUE         TRUE            TRUE        
## 5 kmeans       FALSE           Inf TRUE         TRUE            TRUE        
## 6 hclust       FALSE           Inf TRUE         TRUE            TRUE

The idea is anyone can add the most popular models or some configuration that is not there.
There are some redundancies, but the purpose is to focus on the model, not the needed metadata.
This way we don't think in no NA in random forest, we just write randomForest.

Some custom configurations:

no_na: no NA variables.
num_no_na: numeric with no NA (for example, useful when doing deep learning).

Embed in a data flow on production 🚚

Many people ask for typical questions when interviewing candidates. I like these ones: "How do you deal with new data?" or "What are the considerations you have when you do a deploy?"

Based on our first example:

integ_mod_1

## 
## ✖ {NA detected} num_vessels_flour, thal, gender
## ✖ {Character detected} gender, has_heart_disease
## ✖ {One unique value} constant

We can check:

integ_mod_1$data_ok

## [1] FALSE

data_ok is a flag useful to stop a process raising an error if anything goes wrong.

More examples 🎁

Example 2:

On mtcars data frame, check if there is any variable with NA:

di2=data_integrity_model(data = mtcars, model_name = "no_na")

# Check:
di2

## ✔ Data model integrity ok!

Good to go?

di2$data_ok

## [1] TRUE

Example 3:

data_integrity_model(data = heart_disease, model_name = "pca")

## 
## ✖ {NA detected} num_vessels_flour, thal
## ✖ {Non-numeric detected} gender, chest_pain, fasting_blood_sugar, resting_electro, thal, exter_angina, has_heart_disease

Example 4:

data_integrity_model(data = iris, model_name = "kmeans")

## 
## ✖ {Non-numeric detected} Species

Any suggestions?

If you come across any cases which aren't covered here, you are welcome to contribute: funModeling's github.

How about time series? I took them as: numeric with no na (model_name = num_no_na). You can add any new model by updating the table metadata_models.

And that's it.

In case you want to understand more about data types and qualilty, you can check the Data Science Live Book 📗

Have data fun! 🚀

📬 You can found me at: Linkedin & Twitter.

Fast data exploration for predictive modeling

Pablo Casas — Wed, 18 Sep 2019 15:42:29 GMT

The problem: Before modeling, we need to check/change numerical, categorical, NAs, one unique value and high cardinality variables.

The new version of funModeling 1.9.2 was released aimed to have assistance during the prior step in creating machine learning models.

This post has its continues on Automatic data types checking in predictive models

Introduction

data_integrity function provide information about the format of all the variables, as well as some short stats about NA values.

This way we can select and transform the variables, keeping them in the format we need.

# install.packages("funModeling")
library(funModeling)

Load the messy data:

library(tidyverse)
data=read_delim("https://raw.githubusercontent.com/pablo14/data-integrity/master/messy_data.txt", delim = ';')

Now we call to data_integrity function, which returns an integrity object:

di=data_integrity(data)

Then, summary function gives us a quick self-explanatory overview :

summary(di)

## 
## ◌ {Numerical with NA} num_vessels_flour, thal
## ◌ {Categorical with NA} gender
## ● {One unique value} constant

Now we can apply mutate_at, select, or apply other function over certain and specific columns.

In case we need the variable name as a vector of strings, we can use the RStudio bare-combine add-in:

My keyboard shortcut for this lil' function gets quite the workout…
📺 "hrbraddins::bare_combine()" by @hrbrmstr https://t.co/8dwqNEso0B #rstats pic.twitter.com/gyqz2mUE0Y
— Mara Averick (@dataandme) July 29, 2019

The high cardinality max value can be changed using the parameter MAX_UNIQUE

Accessing all the information

If we print the integrity object, we can see a lot of information regarding NA, numerical, categorical and other types, alongside the high cardinality variables:

di

## $vars_num_with_NA
##            variable q_na       p_na
## 1 num_vessels_flour    4 0.01320132
## 2              thal    2 0.00660066
## 
## $vars_cat_with_NA
##   variable q_na       p_na
## 1   gender    1 0.00330033
## 
## $vars_cat_high_card
## [1] variable unique  
## <0 rows> (or 0-length row.names)
## 
## $MAX_UNIQUE
## [1] 35
## 
## $vars_one_value
## [1] "constant"
## 
## $vars_cat
## [1] "gender"            "has_heart_disease"
## 
## $vars_num
##  [1] "age"                    "chest_pain"             "resting_blood_pressure"
##  [4] "serum_cholestoral"      "fasting_blood_sugar"    "resting_electro"       
##  [7] "max_heart_rate"         "exer_angina"            "oldpeak"               
## [10] "slope"                  "num_vessels_flour"      "thal"                  
## [13] "heart_disease_severity" "exter_angina"           "constant"              
## [16] "id"                    
## 
## $vars_char
## [1] "gender"            "has_heart_disease"
## 
## $vars_factor
## character(0)
## 
## $vars_other
## [1] "has_heart_disease2" "fecha"              "fecha2"

And each object is accessible to operate quickly:

di$results$vars_num

##  [1] "age"                    "chest_pain"             "resting_blood_pressure"
##  [4] "serum_cholestoral"      "fasting_blood_sugar"    "resting_electro"       
##  [7] "max_heart_rate"         "exer_angina"            "oldpeak"               
## [10] "slope"                  "num_vessels_flour"      "thal"                  
## [13] "heart_disease_severity" "exter_angina"           "constant"              
## [16] "id"

Numerical variables with NA values:

di$results$vars_num_with_NA$variable

## [1] "num_vessels_flour" "thal"

Help page:

help("data_integrity")

New `status` function

This is the internal function used in data_integrity:

status(heart_disease)

##                  variable q_zeros   p_zeros q_na       p_na q_inf p_inf    type unique
## 1                     age       0 0.0000000    0 0.00000000     0     0 integer     41
## 2                  gender       0 0.0000000    0 0.00000000     0     0  factor      2
## 3              chest_pain       0 0.0000000    0 0.00000000     0     0  factor      4
## 4  resting_blood_pressure       0 0.0000000    0 0.00000000     0     0 integer     50
## 5       serum_cholestoral       0 0.0000000    0 0.00000000     0     0 integer    152
## 6     fasting_blood_sugar     258 0.8514851    0 0.00000000     0     0  factor      2
## 7         resting_electro     151 0.4983498    0 0.00000000     0     0  factor      3
## 8          max_heart_rate       0 0.0000000    0 0.00000000     0     0 integer     91
## 9             exer_angina     204 0.6732673    0 0.00000000     0     0 integer      2
## 10                oldpeak      99 0.3267327    0 0.00000000     0     0 numeric     40
## 11                  slope       0 0.0000000    0 0.00000000     0     0 integer      3
## 12      num_vessels_flour     176 0.5808581    4 0.01320132     0     0 integer      4
## 13                   thal       0 0.0000000    2 0.00660066     0     0  factor      3
## 14 heart_disease_severity     164 0.5412541    0 0.00000000     0     0 integer      5
## 15           exter_angina     204 0.6732673    0 0.00000000     0     0  factor      2
## 16      has_heart_disease       0 0.0000000    0 0.00000000     0     0  factor      2

It's another version of df_status, where percentages are expressed in the range o 0 to 1 (not 0 to 100). More intuitive to use in filters

This is the same object as di$status_now.

Next realase?

It will contain, based on data_integrity, an automated data quality test suited for the predictive model we need to run.
Found this task quite important and repetitive when I teach. Hopefully it will save some time!

How to use `recipes` package from `tidymodels` for one hot encoding 🛠

Pablo Casas — Mon, 08 Jul 2019 16:52:43 GMT

Since once of the best way to learn, is to explain, I want to share with you this quick introduction to recipes package, from the tidymodels family.
It can help us to automatize some data preparation tasks.

The overview is:

How to create a recipe
How to add a step
How to do the prep
Getting the data with juice!
Apply the prep to new data
What is the difference between bake and juice?
Dealing with new values in recipes (step_novel)

Since I'm new to this package, if you have something to add just put in the comments ;)

Introduction

If you are new to R or you do a 1-time analysis, you could not see the main advantage of this, which is -in my opinion- to have most of the data preparation steps in one place. This way is easier to split between dev and prod.

Dev: The stage in which we create the model
Prod: The moment in which we run the model with new data

The other big advantage is it follows the tidy philosophy, so many things will be familiar.

How to use `recipes` for one hot encoding

It is focused on one hot encoding, but many other functions like scaling, applying PCA and others can be performed.

But first, what is one hot encoding?

It's a data preparation technique to convert all the categorical variables into numerical, by assigning a value of 1 when the row belongs to the category. If the variable has 100 unique values, the final result will contain 100 columns.

That's why it is a good practice to reduce the cardinality of the variable before continuing Learn more about it in the High Cardinality Variable in Predictive Modeling from the Data Science Live Book 📗.

Let's start the example with recipes!

1st - How to create a `recipe`

library(recipes)
library(tidyverse)

set.seed(3.1415)
iris_tr=sample_frac(iris, size = 0.7)

rec = recipe( ~ ., data = iris_tr)

rec

## Data Recipe
## 
## Inputs:
## 
##       role #variables
##  predictor          5

summary(rec)

## # A tibble: 5 x 4
##   variable     type    role      source  
##                      
## 1 Sepal.Length numeric predictor original
## 2 Sepal.Width  numeric predictor original
## 3 Petal.Length numeric predictor original
## 4 Petal.Width  numeric predictor original
## 5 Species      nominal predictor original

The formula ~ ., specifies that all the variables are predictors (with no outcomes).

Please note now we have two different data types, numeric and nominal (not factor nor character).

2nd - How to add a step

Now we add the step to create the dummy variables, or the one hot encoding, which can be seen as the same.

When we do the one hot encoding (one_hot = T), all the levels will be present in the final result. Conversely, when we create the dummy variables, we could have all of the variables, or one less (to avoid the multi-correlation issue).

rec_2 = rec %>% step_dummy(Species, one_hot = T)

rec_2

## Data Recipe
## 
## Inputs:
## 
##       role #variables
##  predictor          5
## 
## Operations:
## 
## Dummy variables from Species

Now we see the dummy step.

3rd - How to do the `prep`

prep is like putting all the ingredients together, but we didn't cook yet!

It generates the metadata to do the data preparation.

As we can see here:

# Aplico la receta, que tiene 1 step, a los datos
d_prep=rec_2 %>% prep(training = iris_tr, retain = T)

d_prep

## Data Recipe
## 
## Inputs:
## 
##       role #variables
##  predictor          5
## 
## Training data contained 105 data points and no missing data.
## 
## Operations:
## 
## Dummy variables from Species [trained]

Note we are in the "training" or dev stage. That's why we see the parameter training.

We will see retain = T in the next step.

Checking:

summary(d_prep)

## # A tibble: 7 x 4
##   variable           type    role      source  
##                            
## 1 Sepal.Length       numeric predictor original
## 2 Sepal.Width        numeric predictor original
## 3 Petal.Length       numeric predictor original
## 4 Petal.Width        numeric predictor original
## 5 Species_setosa     numeric predictor derived 
## 6 Species_versicolor numeric predictor derived 
## 7 Species_virginica  numeric predictor derived

Whoila! 🎉 We have the 3-new derived columns (one hot), and it removed the original Species.

4th - Getting the data with `juice`!

Using juice function:

d2=juice(d_prep)

head(d2)

## # A tibble: 6 x 7
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species_setosa
##                                          
## 1          5           3            1.6         0.2              1
## 2          6.9         3.2          5.7         2.3              0
## 3          6.3         3.3          4.7         1.6              0
## 4          5.3         3.7          1.5         0.2              1
## 5          6.3         2.3          4.4         1.3              0
## 6          6.7         3            5.2         2.3              0
## # … with 2 more variables: Species_versicolor ,
## #   Species_virginica

juice worked because we retained the training data in the 3rd step (retain = T). Otherwise it would have returned:

⚠️ Error: Use retain = TRUE in prep to be able to extract the training set

5th - Apply the prep to new data

Now imagine we have new data as follows:

iris_new=sample_n(iris, size = 5) # taking 5 random rows

d_baked=bake(d_prep, new_data = iris_new)

d_baked

## # A tibble: 5 x 7
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species_setosa
##                                          
## 1          6.4         3.2          4.5         1.5              0
## 2          4.6         3.4          1.4         0.3              1
## 3          5.2         2.7          3.9         1.4              0
## 4          4.8         3.4          1.6         0.2              1
## 5          4.8         3            1.4         0.3              1
## # … with 2 more variables: Species_versicolor ,
## #   Species_virginica

It worked!

bake receives the prep object (d_prep) and it applies to the new_data (iris_new)

What is the difference between `bake` and `juice`?

From this perspective given the training data, following data frames are the same:

d_tr_1=bake(d_prep, new_data = iris_tr)
d_tr_2=d2=juice(d_prep) # with retain=T

identical(d_tr_1, d_tr_2)

## [1] TRUE

Dealing with new values in recipes

Simulate a new value:

new_row=iris[1,] %>% mutate(Species=as.character(Species))
new_row[1, "Species"]="i will break your code"

new_row

##   Sepal.Length Sepal.Width Petal.Length Petal.Width                Species
## 1          5.1         3.5          1.4         0.2 i will break your code

We use bake to convert the new data set:

d2_b=bake(d_prep, new_data = new_row)

## Warning: There are new levels in a factor: i will break your code

The solution! Use `step_novel`

(Thanks to Max Kuhn)

When we do the prep, we have to add step_novel. So any new value will be assigned to the _new category.

We will start right from the beginning:

rec_2_bis = recipe( ~ ., data = iris_tr) %>% 
  step_novel(Species) %>% 
  step_dummy(Species, one_hot = T)

prep_bis = prep(rec_2_bis, training = iris_tr)

Get to final data, and check it:

processed = bake(prep_bis, iris_tr)

funModeling::df_status(processed)

##             variable q_zeros p_zeros q_na p_na q_inf p_inf    type unique
## 1       Sepal.Length       0    0.00    0    0     0     0 numeric     32
## 2        Sepal.Width       0    0.00    0    0     0     0 numeric     23
## 3       Petal.Length       0    0.00    0    0     0     0 numeric     42
## 4        Petal.Width       0    0.00    0    0     0     0 numeric     20
## 5     Species_setosa      68   64.76    0    0     0     0 numeric      2
## 6 Species_versicolor      69   65.71    0    0     0     0 numeric      2
## 7  Species_virginica      73   69.52    0    0     0     0 numeric      2
## 8        Species_new     105  100.00    0    0     0     0 numeric      1

Please note that Species_new has been automatically created (with zeros).

👉 This ensures it runs well once in production.

Now let's see what happen when we have the new value:

new_row_2=bake(prep_bis, new_data = new_row)

new_row_2 %>% select(Species_new)

## # A tibble: 1 x 1
##   Species_new
##         
## 1           1

It works!

Conclusions 💡

The recipes package seems to be a good way to standardize certain data preparation tasks.
Probably one of the strongest points in R, alongside the dplyr package.

📌 Take care of the data pipeline, it is what interviewers will ask you for.

I tried to cover with simple and reproducible examples, many of the situations that happen when we work with productive environments, in the Data Science Live Book 📗 (open-source).

Have fun 🚀

📬 You can found me at: Linkedin & Twitter.

References:

Basic recipes example
Modeling with parsnip and tidymodels by Benjamin Sorensen.
Creating and Preprocessing a Design Matrix with Recipes (video)

Jugando con las dimensiones: desde Clustering, PCA, t-SNE.... ¡hasta Carl Sagan!

Pablo Casas — Mon, 03 Jun 2019 13:30:13 GMT

👉 Actualización! 7/4/20 La nueva versión de este post con mejoras y comentarios sobre UMAP, acá: https://escueladedatosvivos.ai/blog/204650/jugando-con-las-dimensiones-clustering-pca-tsne-carl-sagan

Jugando con las dimensiones

¡Hola! Este post es un experimento que combina el resultado de t-SNE con dos técnicas de clustering bien conocidas: k-means y hierarchical. Esta será la sección práctica, en R.

Pero también, este post explorará el punto de intersección de conceptos como reducción de dimensiones, análisis de clustering, preparación de datos, PCA, HDBSCAN, k-NN, SOM, deep learning....y Carl Sagan!

PCA y t-SNE

Para aquellos que no conocen la técnica t-SNE (sitio oficial), es una técnica de proyección -o reducción de dimensiones- similar en algunos aspectos al Análisis de Componentes Principales (PCA), utilizado para visualizar, por ejemplo, N variables en 2.

Cuando la salida de t-SNE es deficiente, Laurens van der Maaten (autor de t-SNE) dice:

Como prueba de sanidad, intente ejecutar PCA en sus datos para reducirlos a dos dimensiones. Si esto también da malos resultados, entonces tal vez no hay una estructura buena en sus datos en primer lugar. Si PCA funciona bien pero t-SNE no lo hace, estoy bastante seguro de que usted hizo algo mal.

En mi experiencia, hacer PCA con docenas de variables con:

Algunos valores extremos
Distribuciones sesgadas
Varias variables dummy o one-hot (0 ó 1),

No conduce a buenas visualizaciones.

Miren este ejemplo comparando los dos métodos:

Fuente: Clusterización en 2 dimensiones usando tsne

Tiene sentido, ¿no?

Surfeando en dimensiones superiores 🏄

Dado que uno de los resultados t-SNE es una matriz de dos dimensiones, donde cada punto representa un caso de entrada, podemos aplicar un clustering y luego agrupar los casos de acuerdo a su distancia en este mapa de 2 dimensiones. Al igual que un mapa geográfico con la cartografía de 3 dimensiones (nuestro mundo), en dos (papel).

El t-SNE agrupa casos similares, manejando muy bien las no linearidades de los datos. Después de usar el algoritmo en varios conjuntos de datos, creo que en algunos casos crea algo parecido a formas circulares como islas, donde estos casos son similares.

Sin embargo, no vi este efecto en la demostración interactiva del equipo de Google Brain: How to Use t-SNE Effectively. Tal vez debido a la naturaleza de los datos de entrada, 2 variables como entrada.

Los datos del rollo suizo (swiss roll)

t-SNE de acuerdo a su FAQ no funciona muy bien con los datos de juguete swiss roll. Sin embargo, es un ejemplo impresionante de cómo una superficie tridimensional (o manifold) con forma concreta de espiral se despliega como el papel gracias a una técnica de reducción de dimensiones.

La imagen ha sido tomada de este paper, donde usaron la técnica de "manifold sculpting".

Ahora la práctica en R!

t-SNE ayuda a hacer que el cluster sea más preciso porque convierte los datos en un espacio de 2 dimensiones donde los puntos están en forma circular (lo que a su vez resulta agradable para el k-means, y es uno de sus puntos débiles a la hora de crear segmentos). Más sobre esto: K-means clustering is not a free lunch).

Tal como si fuera una preparación de datos para aplicar los modelos de clustering.


library(caret)
library(Rtsne)

######################################################################
## The WHOLE post is in: https://github.com/pablo14/post_cluster_tsne
######################################################################

## Download data from: https://github.com/pablo14/post_cluster_tsne/blob/master/data_1.txt (url path inside the gitrepo.)
data_tsne=read.delim("data_1.txt", header = T, stringsAsFactors = F, sep = "\t")

## Rtsne function may take some minutes to complete...
set.seed(9)
tsne_model_1 = Rtsne(as.matrix(data_tsne), check_duplicates=FALSE, pca=TRUE, perplexity=30, theta=0.5, dims=2)

## getting the two dimension matrix
d_tsne_1 = as.data.frame(tsne_model_1$Y)

Diferentes ejecuciones de Rtsne conducen a diferentes resultados. Por lo tanto, lo más probable es que no se vea exactamente el mismo modelo que el que se presenta aquí.

Según la documentación oficial, la "perplejidad" (perplexity) está relacionada con la importancia de los vecinos:

Es comparable con el número de vecinos más cercanos k que se emplea en muchos aprendedores de manifold".
Los valores típicos para el rango de perplejidad van entre 5 y 50"

El objeto tsne_model_1$Y contiene las coordenadas X-Y (variables V1 y V2), para cada caso de entrada.

Graficando los resultados de t-SNE:

## plotting the results without clustering
ggplot(d_tsne_1, aes(x=V1, y=V2)) +
  geom_point(size=0.25) +
  guides(colour=guide_legend(override.aes=list(size=6))) +
  xlab("") + ylab("") +
  ggtitle("t-SNE") +
  theme_light(base_size=20) +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank()) +
  scale_colour_brewer(palette = "Set2")

Y están las famosas "islas" 🏝️. En este punto, podemos hacer un poco de clustering mirándolo.... Pero probemos k-Means y clustering jerárquico en su lugar 😄. La página de preguntas frecuentes de t-SNE sugiere disminuir el parámetro de perplejidad para evitar esto, sin embargo no encontré ningún problema con este resultado.

Creando los modelos de clústeres

La siguiente pieza de código creará los modelos de clúster k-means y jerárquico. Para entonces asignar el número de cluster (1, 2 ó 3) al que pertenece cada caso de entrada.

## keeping original data
d_tsne_1_original=d_tsne_1

## Creating k-means clustering model, and assigning the result to the data used to create the tsne
fit_cluster_kmeans=kmeans(scale(d_tsne_1), 3)
d_tsne_1_original$cl_kmeans = factor(fit_cluster_kmeans$cluster)

## Creating hierarchical cluster model, and assigning the result to the data used to create the tsne
fit_cluster_hierarchical=hclust(dist(scale(d_tsne_1)))

## setting 3 clusters as output
d_tsne_1_original$cl_hierarchical = factor(cutree(fit_cluster_hierarchical, k=3))

Graficando los modelos de clústeres en la salida de t-SNE

Ahora es el momento de graficar el resultado de cada modelo de clúster, basado en el mapa t-SNE.

plot_cluster=function(data, var_cluster, palette)
{
  ggplot(data, aes_string(x="V1", y="V2", color=var_cluster)) +
  geom_point(size=0.25) +
  guides(colour=guide_legend(override.aes=list(size=6))) +
  xlab("") + ylab("") +
  ggtitle("") +
  theme_light(base_size=20) +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        legend.direction = "horizontal", 
        legend.position = "bottom",
        legend.box = "horizontal") + 
    scale_colour_brewer(palette = palette) 
}


plot_k=plot_cluster(d_tsne_1_original, "cl_kmeans", "Accent")
plot_h=plot_cluster(d_tsne_1_original, "cl_hierarchical", "Set1")

## and finally: putting the plots side by side with gridExtra lib...
library(gridExtra)
grid.arrange(plot_k, plot_h,  ncol=2)

Análisis visual

En este caso, y basado sólo en el análisis visual, lo jerárquico parece tener más sentido común que el k-means. Miren la siguiente imagen:

Nota: las líneas punteadas que separan los clusters fueron dibujadas a mano.

En k-means, la distancia en los puntos de la esquina inferior izquierda están bastante cerca en comparación con la distancia de otros puntos dentro del mismo cluster. Pero pertenecen a diferentes grupos. Ilustrándolo:

Así que tenemos: la flecha roja es más corta que la azul....

Nota: Diferentes ejecuciones pueden llevar a diferentes agrupaciones, si no ve este efecto en esa parte del mapa, búsquelo en otra.

Este efecto no ocurre en el clustering jerárquico. Los conglomerados con este modelo parecen más uniformes. Pero, ¿qué te parece?

Sesgando el análisis (haciendo trampa)

No es justo para k-means que se compare así. El último análisis está basado en la idea de clustering por densidad. Esta técnica es realmente genial para superar las trampas de los métodos más simples.

El algoritmo HDBSCAN basa su proceso en densidades.

Encuentra la esencia de cada uno mirando esta foto:

Seguramente entendieron la diferencia entre ellos...

La última imagen viene de: Comparing Python Clustering Algorithms. Si, Python, pero es lo mismo para R. El paquete es largeVis. (Note: Install it by doing: install_github("elbamos/largeVis", ref = "release/0.2").

Deep learning and t-SNE

Citando a Luke Metz desde un gran post (Visualizing with t-SNE):

En los últimos tiempos se ha producido un gran revuelo en torno al término " deep learning ". En la mayoría de las aplicaciones, estos modelos "profundos" pueden reducirse a la composición de funciones simples que se integran de un espacio dimensional alto a otro. A primera vista, estos espacios pueden parecer demasiado grandes para pensar o visualizar, pero técnicas como t-SNE nos permiten empezar a entender lo que está ocurriendo dentro de la caja negra. Ahora, en lugar de tratar estos modelos como cajas negras, podemos empezar a visualizarlos y entenderlos.

Un comentario profundo 👏.

Pensamientos finales 🚀

Más allá de este post, t-SNE ha demostrado ser una herramienta de propósito general para reducir la dimensionalidad. Puede ser usado para explorar las relaciones dentro de los datos construyendo clusters, o para analizar casos de anomalías , mediante la inspección de los puntos aislados en el mapa.

Jugar con las dimensiones es un concepto clave en la ciencia de datos y en machine leraning. El parámetro de perplejidad es realmente similar al k en el algoritmo del vecino más cercano (k-NN). ¿Mapear datos en 2 dimensiones y luego hacer clustering? Hmmm eso no es nuevo amigo: Self-Organising Maps for Customer Segmentation.

Cuando seleccionamos las mejores variables para construir un modelo, estamos reduciendo la dimensión de los datos. Cuando construimos un modelo, estamos creando una función que describe las relaciones en los datos.... y así sucesivamente.....

¿Conocías los conceptos generales sobre k-NN y PCA? Bueno, este es un paso más, sólo hay que conectar los cables en el cerebro y ya está. El aprendizaje de conceptos generales nos da la oportunidad de hacer este tipo de asociaciones entre todas estas técnicas. Más allá de la comparación de lenguajes de programación, el poder -en mi opinión- es tener el foco en cómo se comportan los datos, y cómo estas técnicas están y pueden ser conectadas.

Explora la imaginación con este video de Carl Sagan: Tierra Plana y la 4ª Dimensión. Un cuento sobre la interacción de objetos 3D en un plano 2D....

📌 Continua aprendiendo sobre machine learning!

📗 Libro Vivo de Ciencia de Datos (open-source) Completamente disponible en línea!

redshiftTools v1.0.0 - CRAN Release!

Pablo Seibelt — Fri, 17 May 2019 15:00:00 GMT

A new version of the package redshiftTools has arrived with improvements and it's now available in CRAN! This package let's you efficiently upload data into an Amazon Redshift database using the approach recommended by Amazon

This package is helpful because otherwise uploading data with inserts in Redshift is super slow, this is the recommended way of doing replaces and upserts per the Redshift documentation, which consists of generating various CSV files, uploading them to an S3 bucket and then calling a copy command on the Redshift server, all of that is handled by the package.

To install this package, use the following command:

install.packages('redshiftTools')

After installing, you'll have these functions to use, which are explained in full detail in the package's man pages.

rs_create_statement: Generates the SQL statement to create a table based on the structure of a data.frame. It allows you to specify sort key, dist key and if you want to allow compression to be added or not.

rs_replace_table: Deletes all records in a table, then uploads the provided data frame into it. It runs as a transaction so the table is never empty to the other users.

rs_upsert_table: Deletes all records matching the provided keys from the uploaded dataset, and then inserts the rows from the dataset. If no keys are provided, it acts as a regular insert.

rs_cols_upsert_table: Like rs_upsert_table but can choose only some columns to update

rs_append_table: Like the previous functions but only appends data without altering existing data.

rs_create_table: This just runs rs_create_statement and then rs_replace_table, creating a table with the same structure as your data frame and then uploading the data frame to it.

For more details, read the official README in https://github.com/sicarul/redshiftTools

A special thanks to all the collaborators that sent contributions to the package:

Future Plans

For future versions, i plan to include additional utility functions that allow you to obtain table metadata, optimize table encoding, check table permissions, etc. If you feel like you have some cool functionality to share please share your pull request!

Lanzamiento! Libro Vivo de Ciencia de Datos 📗 (open-source)

Pablo Casas — Thu, 04 Apr 2019 16:51:23 GMT

Finalmente disponible la versión en español del Data Science Live Book! El libro se abre sin barreras idiomáticas ante las personas de habla-hispana con ganas de aprender 👨‍🎓👩‍🎓.

Esta publicación es una edición revisada tanto en gramática como en aspectos técnicos de la versión en inglés. Pueden acceder a la versión on-line, completa en:

👉 LibroVivoDeCienciaDeDatos.ai 🚀

El Data Science Live Book, junto con dos artículos de como auto-publicar un libro usando bookdown, fueron premiados por RStudio en el 1st Bookdown Contest.

¿Por qué publicar en español si ya está en inglés?

Recuerdo cuando comencé a estudiar Data Science, (o Data Mining, como se decía en aquel entonces), me costaba bastante más entender los conceptos técnicos al mismo tiempo que traducía, o bien buscaba las palabras en un diccionario.

Si bien para estar en este mundo de datos, hace falta leer en ingles, esta versión busca acercar la ciencia de datos a las personas que todavía no están cómodas leyendo en otro idioma.

Por suerte cada vez hay mas recursos en español para aprender. Espero que el libro motive a que se siga escribiendo en otros idiomas.

Libro open-source

La versión en español sigue siendo open-source, acá su repositorio en Github por si quieren hacer sugerencias o detectan bugs que escribi silenciosamente.

Licencia: Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)

¿Va a salir una versión en papel impreso?

Si, durante las próximas semanas, como la versión en inglés.

¿Cómo descargo el libro?

Si les gusta y quieren apoyar el proyecto (ademas de ayudar a cubrir algunos gastos de publicación), lo pueden descargar bajo la filosofía de "name a fair price" (escriba un precio justo), con un piso de US$ 5:

Página de descarga del libro 📥.

Gracias a los que colaboraron haciendo alguna sugerencia: Alain Rodriguez, Andrew White, Chip Oglesby, Federico Molina, Federico Otero, Jonas Ertel, Lucas Crespo, Pablo Seibelt, Stuart Hertzog, Holger K. von Jouanne-Diedrich, Bernardo Lares, Kevin Hammond, Sebastian Varela, Damian Covalski.

Gracias por leer, y espero que el libro les sea útil! 🚀

Quedense en contacto en Twitter y Linkedin.

A gentle introduction to SHAP values in R

Pablo Casas — Mon, 18 Mar 2019 15:23:34 GMT

Hi there! During the first meetup of argentinaR.org -an R user group- Daniel Quelali introduced us to a new model validation technique called SHAP values.

This novel approach allows us to dig a little bit more in the complexity of the predictive model results, while it allows us to explore the relationships between variables for predicted case.

I've been using this it with "real" data, cross-validating the results, and let me tell you it works.
This post is a gentle introduction to it, hope you enjoy it!

Find me on Twitter and Linkedin.

Clone this github repository to reproduce the plots.

Introduction

Complex predictive models are not easy to interpret. By complex I mean: random forest, xgboost, deep learning, etc.

In other words, given a certain prediction, like having a likelihood of buying= 90%, what was the influence of each input variable in order to get that score?

A recent technique to interpret black-box models has stood out among others: SHAP (SHapley Additive exPlanations) developed by Scott M. Lundberg.

Imagine a sales score model. A customer living in zip code "A1" with "10 purchases" arrives and its score is 95%, while other from zip code "A2" and "7 purchases" has a score of 60%.

Each variable had its contribution to the final score. Maybe a slight change in the number of purchases changes the score a lot, while changing the zip code only contributes a tiny amount on that specific customer.

SHAP measures the impact of variables taking into account the interaction with other variables.

Shapley values calculate the importance of a feature by comparing what a model predicts with and without the feature. However, since the order in which a model sees features can affect its predictions, this is done in every possible order, so that the features are fairly compared.

Source

SHAP values in data

If the original data has 200 rows and 10 variables, the shap value table will have the same dimension (200 x 10).

The original values from the input data are replaced by its SHAP values. However it is not the same replacement for all the columns. Maybe a value of 10 purchases is replaced by the value 0.3 in customer 1, but in customer 2 it is replaced by 0.6. This change is due to how the variable for that customer interacts with other variables. Variables work in groups and describe a whole.

Shap values can be obtained by doing:

shap_values=predict(xgboost_model, input_data, predcontrib = TRUE, approxcontrib = F)

Example in R

After creating an xgboost model, we can plot the shap summary for a rental bike dataset. The target variable is the count of rents for that particular day.

Function plot.shap.summary (from the github repo) gives us:

How to interpret the shap summary plot?

The y-axis indicates the variable name, in order of importance from top to bottom. The value next to them is the mean SHAP value.
On the x-axis is the SHAP value. Indicates how much is the change in log-odds. From this number we can extract the probability of success.
Gradient color indicates the original value for that variable. In booleans, it will take two colors, but in number it can contain the whole spectrum.
Each point represents a row from the original dataset.

Going back to the bike dataset, most of the variables are boolean.

We can see that having a high humidity is associated with high and negative values on the target. Where high comes from the color and negative from the x value.

In other words, people rent fewer bikes if humidity is high.

When season.WINTER is high (or true) then shap value is high. People rent more bikes in winter, this is nice since it sounds counter-intuitive. Note the point dispersion in season.WINTER is less than in hum.

Doing a simple violin plot for variable season confirms the pattern:

As expected, rainy, snowy or stormy days are associated with less renting. However, if the value is 0, it doesn't affect much the bike renting. Look at the yellow points around the 0 value. We can check the original variable and see the difference:

What conclusion can you draw by looking at variables weekday.SAT and weekday.MON?

Shap summary from xgboost package

Function xgb.plot.shap from xgboost package provides these plots:

y-axis: shap value.
x-axis: original variable value.

Each blue dot is a row (a day in this case).

Looking at temp variable, we can see how lower temperatures are associated with a big decrease in shap values. Interesting to note that around the value 22-23 the curve starts to decrease again. A perfect non-linear relationship.

Taking mnth.SEP we can observe that dispersion around 0 is almost 0, while on the other hand, the value 1 is associated mainly with a shap increase around 200, but it also has certain days where it can push the shap value to more than 400.

mnth.SEP is a good case of interaction with other variables, since in presence of the same value (1), the shap value can differ a lot. What are the effects with other variables that explain this variance in the output? A topic for another post.

R packages with SHAP

Interpretable Machine Learning by Christoph Molnar.

shapper

A Python wrapper:

xgboostExplainer

Altough it's not SHAP, the idea is really similar. It calculates the contribution for each value in every case, by accessing at the trees structure used in model.

Recommended literature about SHAP values 📚

There is a vast literature around this technique, check the online book Interpretable Machine Learning by Christoph Molnar. It addresses in a nicely way Model-Agnostic Methods and one of its particular cases Shapley values. An outstanding work.

From classical variable, ranking approaches like weight and gain, to shap values: Interpretable Machine Learning with XGBoost by Scott Lundberg.

A permutation perspective with examples: One Feature Attribution Method to (Supposedly) Rule Them All: Shapley Values.

If you have any questions, leave it below :)

Thanks for reading! 🚀

New discretization method: Recursive information gain ratio maximization

Pablo Casas — Wed, 13 Feb 2019 20:10:24 GMT

Hello everyone, I'm happy to share a new method to discretize variables I was working on for the last few months:

Recursive discretization using gain ratio for multi-class variable

tl;dr: funModeling::discretize_rgr(input, target)

The problem: Need to convert a numeric variable into one categorical, considering the relationship with the target variable.

How do we choose the split points for each segment? The selection can improve or worsen the relationship.

Example

# Available from version 1.7 (2019-02-13), please update it before proceeding:
# install.packages("funModeling") 
library(funModeling)
library(dplyr)

heart_disease$oldpeak_2 = discretize_rgr(input=heart_disease$oldpeak, target=heart_disease$has_heart_disease)

Check the results:

Before and after the transformation

head(select(heart_disease, oldpeak, oldpeak_2))

##   oldpeak oldpeak_2
## 1     2.3 [1.9,6.2]
## 2     1.5 [1.4,1.9)
## 3     2.6 [1.9,6.2]
## 4     3.5 [1.9,6.2]
## 5     1.4 [1.4,1.9)
## 6     0.8 [0.6,1.0)

Checking the distribution

summary(heart_disease$oldpeak_2)

## [0.0,0.6) [0.6,1.0) [1.0,1.4) [1.4,1.9) [1.9,6.2] 
##       135        31        34        39        64

Plotting

cross_plot(heart_disease, input = "oldpeak_2", target = "has_heart_disease")

Left: accuracy, right: representativeness (sample size).

More info about cross_plot here.

Parameters

min_perc_bins: Controls the minimum sample size per bin, 0.1 or 10% as default.
max_n_bins: Maximum number of bins to split the input variable, 5 bins as default.

Both parameters are related, in the sense that setting a higher number in min_perc_bins may not satisfy the number of desired bins (max_n_bins).

Little benchmark

Next image shows ROC metrics for two models, one with the original variable and another with the discretized variable. In this case, the discretization improves ROC value, but decreases the specificity.

Other scenarios

Case 1: Missing values in numeric variables.

In this case the way we discretize a variable weight more heavily. One data preparation trick is to convert it to categorical, when one category is "NA" and the remaining categories are the bins calculated by the algorithm. funModeling supports this scenario for equal frequency discretization, and will do the same for discretize_rgr.

Case 2: Exploratory data analysis

From the discretization, we can semantically describe the relationship between the input and the target variable. Finding the segments that maximizes the likelihood might be quite helpful to report in our job or research.

About the method

It keeps a minimum sample size per segment (representativity), thanks to min_perc_bins
It uses the gain ratio metric to calculate the best split point that maximizes the target variable likelihood (accuracy).

The control of minimum sample size helps to avoid bias in segments with low representativity.

Gain ratio is an improvement over information gain, commonly used in decision trees, since it penalizes variables with high cardinality (like zip code).

The method find the best cut point based on a list of possible candidates. Each candidate is calculated based on the percentiles. Once it finds a point that maximizes gain ratio while at the same time, satisfy the condition of minimum sample size, it creates two search branches considering all the rows above and below the cutpoint, the left and the right respectevelly.

Now again, for each branch the algorithm finds the best point, for that subset of rows, and the process repeats recursivelly until satisfy the stopping criteria.

Learn more

The Data Science Live Book covers some points related to this method:

Discretizing numerical variables.
Sample size and accuracy trade-off, in the case of treating high-cardinality variables.

Want to grasp more about the information theory world? A Simple Guide to Entropy-Based Discretization by Kevin Meurer.

Leave in the comments any doubt ;)

Thanks for reading 🚀

Find me on Twitter and Linkedin.

Want to learn more? 📗 Data Science Live Book

Feature Selection using Genetic Algorithms in R

Pablo Casas — Tue, 15 Jan 2019 14:10:05 GMT

This is a post about feature selection using genetic algorithms in R, in which we will do a quick review about:

What are genetic algorithms?
GA in ML?
What does a solution look like?
GA process and its operators
The fitness function
Genetics Algorithms in R!
Try it yourself
Relating concepts

Animation source: "Flexible Muscle-Based Locomotion for Bipedal Creatures" - Thomas Geijtenbeek

The intuition behind

Imagine a black box which can help us to decide over an unlimited number of possibilities, with a criterion such that we can find an acceptable solution (both in time and quality) to a problem that we formulate.

What are genetic algorithms?

Genetic Algortithms (GA) are a mathematical model inspired by the famous Charles Darwin's idea of natural selection.

The natural selection preserves only the fittest individuals, over the different generations.

Imagine a population of 100 rabbits in 1900, if we look the population today, we are going to others rabbits more fast and skillful to find food than their ancestors.

GA in ML

In machine learning, one of the uses of genetic algorithms is to pick up the right number of variables in order to create a predictive model.

To pick up the right subset of variables is a problem of combinatory and optimization.

The advantage of this technique over others is, it allows the best solution to emerge from the best of prior solutions. An evolutionary algorithm which improves the selection over time.

The idea of GA is to combine the different solutions generation after generation to extract the best genes (variables) from each one. That way it creates new and more fitted individuals.

We can find other uses of GA such as hyper-tunning parameter, find the maximum (or min) of a function or the search for a correct neural network arquitecture (Neuroevolution), or among others...

GA in feature selection

Every possible solution of the GA, which are the selected variables (a single 🐇), are considered as a whole, it will not rank variables individually against the target.

And this is important because we already know that variables work in group.

What does a solution look like?

Keeping it simple for the example, imagine we have a total of 6 variables,

One solution can be picking up 3 variables, let's say: var2, var4 and var5.

Another solution can be: var1 and var5.

These solutions are the so-called individuals or chromosomes in a population. They are possible solutions to our problem.

Credit image: Vijini Mallawaarachchi

From the image, the solution 3 can be expressed as a one-hot vector: c(1,0,1,0,1,1). Each 1 indicates the solution containg that variable. In this case: var1, var3, var5, var6.

While the solution 4 is: c(1,1,0,1,1,0).

Each position in the vector is a gene.

GA process and its operators

The underlying idea of a GA is to generate some random possible solutions (called population), which represent different variables, to then combine the best solutions in an iterative process.

This combination follows the basic GA operations, which are: selection, mutation and cross-over.

Selection: Pick up the most fitted individuals in a generation (i.e.: the solutions providing the highest ROC).
Cross-over: Create 2 new individuals, based on the genes of two solutions. These children will appear to the next generation.
Mutation: Change a gene randomly in the individual (i.e.: flip a 0 to 1)

The idea is for each generation, we will find better individuals, like a fast rabbit.

I recommend the post of Vijini Mallawaarachchi about how a genetic algorithm works.

These basic operations allow the algorithm to change the possible solutions by combining them in a way that maximizes the objective.

The fitness function

This objective maximization is, for example, to keep with the solution that maximizes the area under the ROC curve. This is defined in the fitness function.

The fitness function takes a possible solution (or chromosome, if you want to sound more sophisticated), and somehow evaluates the effectiveness of the selection.

Normally, the fitness function takes the one-hot vector c(1,1,0,0,0,0), creates, for example, a random forest model with var1 and var2, and returns the fitness value (ROC).

The fitness value in this code calculates is: ROC value / number of variables. By doing this the algorithm penalizes the solutions with a large number of variables. Similar to the idea of Akaike information criterion, or AIC.

Genetics Algorithms in R! 🐛

My intention is to provide you with a clean code so you can understand what's behind, while at the same time, try new approaches like modifying the fitness function. This is a crucial point.

To use on your own data set, make sure data_x (data frame) and data_y (factor) are compatible with the custom_fitness function.

The main library is GA, developed by Luca Scrucca. See here the vignette with examples.

📣 Important: The following code is incomplete. Clone the repository to run the example.

# data_x: input data frame
# data_y: target variable (factor)

# GA parameters
param_nBits=ncol(data_x)
col_names=colnames(data_x)

# Executing the GA 
ga_GA_1 = ga(fitness = function(vars) custom_fitness(vars = vars, 
                                                     data_x =  data_x, 
                                                     data_y = data_y, 
                                                     p_sampling = 0.7), # custom fitness function
             type = "binary", # optimization data type
             crossover=gabin_uCrossover,  # cross-over method
             elitism = 3, # best N indiv. to pass to next iteration
             pmutation = 0.03, # mutation rate prob
             popSize = 50, # the number of indivduals/solutions
             nBits = param_nBits, # total number of variables
             names=col_names, # variable name
             run=5, # max iter without improvement (stopping criteria)
             maxiter = 50, # total runs or generations
             monitor=plot, # plot the result at each iteration
             keepBest = TRUE, # keep the best solution at the end
             parallel = T, # allow parallel procesing
             seed=84211 # for reproducibility purposes
)

# Checking the results
summary(ga_GA_1)

── Genetic Algorithm ─────────────────── 

GA settings: 
Type                  =  binary 
Population size       =  50 
Number of generations =  50 
Elitism               =  3 
Crossover probability =  0.8 
Mutation probability  =  0.03 

GA results: 
Iterations             = 17 
Fitness function value = 0.2477393 
Solution = 
     radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean
[1,]           0            1              0         0               0                1
     concavity_mean concave points_mean symmetry_mean fractal_dimension_mean  ... 
[1,]              0                   0             0                      0      
     symmetry_worst fractal_dimension_worst
[1,]              0                       0

# Following line will return the variable names of the final and best solution
best_vars_ga=col_names[ga_GA_1@solution[1,]==1]

# Checking the variables of the best solution...
best_vars_ga

[1] "texture_mean"     "compactness_mean" "area_worst"       "concavity_worst"

Blue dot: Population fitness average
Green dot: Best fitness value

Note: Don't expect the result that fast 😅

Now we calculate the accuracy based on the best selection!

get_accuracy_metric(data_tr_sample = data_x, target = data_y, best_vars_ga)

[1] 0.9508279

The accuracy is around 95,08%, while the ROC value is closed to 0,95 (ROC=fitness value * number of variables, check the fitness function).

Analyzing the results

I don't like to analyze the accuracy without the cutpoint (Scoring Data), but it's useful to compare with the results of this Kaggle post.

He got a similar accuracy result using recursive feature elimination, or RFE, based on 5 variables, while our solution stays with 4.

Try it yourself

Try a new fitness function, some solutions still provide a large number of variables, you can try squaring the number of variables.

Another thing to try is the algorithm to get the ROC value, or even to change the metric.

Some configurations last a lot of time. Balance classes before modeling and play with the p_sampling parameter. Sampling techniques can have a big impact on models. Check the Sample size and class balance on model performance post for more info.

How about changing the rate of mutation or elitism? Or trying other cross-over methods?

Increase the popSize to test more possible solutions at the same time (at a time cost).

Feel free to share any insights or ideas to improve the selection.

Clone the repository to run the example.

Relating concepts

There is a parallelism between GA and Deep Learning, the concept of iteration and improvement over time is similar.

I added the p_sampling parameter to speed up things. And it usually accomplishes its goal. Similar to the batch concept used in Deep Learning. Another parallel is between the GA parameter run and the early stopping criteria in the neural network training.

But the biggest similarity is both techniques come from observing the nature. In both cases, humans observed how neural networks and genetics work, and create a simplified mathematical model that imitate their behavior. Nature has millions of years of evolution, why not try to imitate it? 🌱

I tried to be brief about GA, but if you have any specific question on this vast topic, please leave it in the comments 🙋 🙋‍♂

And, if I didn't motivate you the enough to study GA, check this project which is based on Neuroevolution:

Thanks for reading 🚀

Find me on Twitter and Linkedin.
More blog posts.

Want to learn more? 📗 Data Science Live Book

Integrating R and Telegram

Pablo Casas — Wed, 07 Nov 2018 14:08:12 GMT

Hi there!

tl;dr: Some models (deep learning) take a long time to finish. Even some data preparation scripts. We can be notified that the process ended by Telegram sending messages from R.

Get notified by Telegram bot

This section is entirely based on the documentation of telegram.bot package, by Ernest Benedito. Please visit the site to get used of the full capabilities of this package.

The idea of getting notify by Telegram is we can see the notification either on our cellphone or in the web version.

Step 1: Create a bot

Find @BotFather on telegram. Send the message: \start. Then \newbot. And follow the instructions.

Save the bot token and never share publicly.

Step 2: Set-up the bot

After your bot is created. You have to send the message \start. And the bot is finally configurated!

Step 3: Use it with R

Put the bot token in the .Renviron:

user_renviron <- path.expand(file.path("~", ".Renviron"))
file.edit(user_renviron)

This should look something like this:

Now restart R.

# install.packages("telegram.bot")
library(telegram.bot)

# Initiate the bot session using the token from the enviroment variable.
bot = Bot(token = bot_token('arbot_bot'))

# The first time, you will need the chat id (which is the chat where you will get the notifications)
updates = bot$getUpdates()

> updates
  update_id message.message_id message.from.id message.from.is_bot message.from.first_name message.from.last_name
1 639401623                  1       174860321               FALSE            admin                 admin
2 639401624                  2       174860321               FALSE            admin                 admin
  message.from.language_code message.chat.id message.chat.first_name message.chat.last_name message.chat.type message.date
1                      en-US       174860321            admin                 admin           private   1540571205
2                      en-US       174860321            admin                 admin           private   1540571208
  message.text  message.entities
1       /start 0, 6, bot_command
2        hello              NULL

Time to use in the R workflow! We will send a test message and a plot:

Note 1: chat_id=message.chat.id.
Note 2: R_TELEGRAM_BOT_{the name of your bot}

# Sending text
message_to_bot=sprintf('Process finished - Accuracy: %s', 0.99)

bot$sendMessage(chat_id = 174860321, text = message_to_bot)

# Sending image (we need to save it first)
library(ggplot2)
my_plot=ggplot(mtcars, aes(x=mpg))  + geom_histogram(bins = 5)
ggplot2::ggsave("my_plot.png", my_plot)

bot$sendPhoto(chat_id = 174860321, photo = 'my_plot.png')

The results on telegram web:

Note: I also tested: telegram package and it works. However the telegram.bot seems more complete due to the bot options.

Check the full list of options to interact with the bot 🤖.

Get notified by sound

Another way of getting notified is by producing a sound: 🔔 beep!

# install.packages("beepr")
library(beepr)

## do some stuff, and...

beep()
beep()

Thanks for reading 🚀

Blog | Linkedin | Twitter | 📗 Data Science Live Book

How to apply a function to a matrix/tibble

Pablo Casas — Tue, 25 Sep 2018 18:27:33 GMT

Scenario: we got a table of id-value, and a matrix/tibble that contains the id, and we need the labels.

It may be useful when predicting the Key (or Ids) of in a classification model (like in Keras), and we need the labels as the final output.

There are two interesting things:

The usage of apply based on column and rows at the same time.
The creation of an empty tibble and how to fill it (append columns)

How to apply a function to a matrix/tibble

Scenario: we got a table of id-value, and a matrix/tibble that contains the id, and we need the labels.

It may be useful when predicting the Key (or Ids) in a classification model (like in Keras), and we need the labels as the final output.

There are two interesting things:

The usage of apply based on column and rows at the same time.
The creation of an empty tibble and how to fill it (append columns)

library(tidyverse)
# mapping table (id-value)
map_table=tibble(id=c(1,2,3), 
                 value=c("a", "b", "c")
                 )

map_table

## # A tibble: 3 x 2
##      id value
##    
## 1     1 a    
## 2     2 b    
## 3     3 c

# given a key, retrun the label
get_label <- function(x) 
{
  res=filter(map_table, id==x)$value
  return(res)
}

# the data to get the label
X_data=tibble(v1=c(1,2,3), 
              v2=c(2,2,2),
              v3=c(3,2,1)
              )

X_data

## # A tibble: 3 x 3
##      v1    v2    v3
##     
## 1     1     2     3
## 2     2     2     2
## 3     3     2     1

Option 1: as matrix

mat_res=apply(X_data, 1:2, get_label)

## Checking...
mat_res

##      v1  v2  v3 
## [1,] "a" "b" "c"
## [2,] "b" "b" "b"
## [3,] "c" "b" "a"

Option 2: as tibble (using 'for')

# creating a 1 column with NAs same length as nrow(X_data)
tib_res=tibble(V1=rep(NA, nrow(X_data))) 
for(i in 1:ncol(X_data))
{
  vec=X_data[,i]
  vec_lbl=sapply(t(vec), get_label) # if X_data is a matrid, no need to transpose with t()
  tib_res[,i]=vec_lbl
}

## Checking...
tib_res

## # A tibble: 3 x 3
##   V1    V2    V3   
##     
## 1 a     b     c    
## 2 b     b     b    
## 3 c     b     a

Option 3: as tibble (using 'mutate_all')

tib_res_2=mutate_all(X_data, .funs = get_label)
tib_res_2

## # A tibble: 3 x 3
##   v1    v2    v3   
##     
## 1 a     b     b    
## 2 b     b     b    
## 3 c     b     b

Finally...

Option 2, to my surprise, is faster than the option 1.
I didn't use the add_column because of the need of replacing the first dummy NA column.
Other approaches may include dictionaries.

Any improvement in the code is welcome.

Thanks for reading 🚀

Blog | Linkedin | Twitter | 📗 Data Science Live Book

Data Science Heroes Blog

funModeling: New site, logo and version 🚀

News

Blog posts based on funModeling:

Official page

Learn Data Science

Tips before migrating to a newer R version

Projects that are not frequently executed

Another case 🎬

Package installation

R on servers

So it is not advisable to upgrade packages and R?

Some advice: Environments

Docker

Conclusions

SPAM detection using fastai ULMFiT - Part 1: Language Model

Non-technical introduction

Why is it useful? 🤔

About google colab

Going more technical

Code 💻

How Auth0’s Data Team uses R and Python

R or Python?

So is R capable of running on production?

How do we run R and Python tasks on production?

Do we have a data warehouse?

How do we deal with data preparation using R?

Does R scale?

How do we share the results?

Machine Learning / AI

Summing up!

Automatic data types checking in predictive models

tl;dr;code 💻

Introduction

Supported models 🤖

Embed in a data flow on production 🚚

More examples 🎁

Any suggestions?

Fast data exploration for predictive modeling

Introduction

Load the messy data:

Accessing all the information

New status function

Next realase?

Further reading

How to use `recipes` package from `tidymodels` for one hot encoding 🛠

Introduction

How to use recipes for one hot encoding

1st - How to create a recipe

2nd - How to add a step

3rd - How to do the prep

4th - Getting the data with juice!

5th - Apply the prep to new data

What is the difference between bake and juice?

Dealing with new values in recipes

The solution! Use step_novel

Conclusions 💡

References:

Other posts you might like 🤓...

Jugando con las dimensiones: desde Clustering, PCA, t-SNE.... ¡hasta Carl Sagan!

Jugando con las dimensiones

PCA y t-SNE

Surfeando en dimensiones superiores 🏄

Los datos del rollo suizo (swiss roll)

Ahora la práctica en R!

Creando los modelos de clústeres

Graficando los modelos de clústeres en la salida de t-SNE

Análisis visual

Sesgando el análisis (haciendo trampa)

Deep learning and t-SNE

Pensamientos finales 🚀

redshiftTools v1.0.0 - CRAN Release!

Future Plans

Lanzamiento! Libro Vivo de Ciencia de Datos 📗 (open-source)

👉 LibroVivoDeCienciaDeDatos.ai 🚀

¿Por qué publicar en español si ya está en inglés?

Libro open-source

¿Va a salir una versión en papel impreso?

¿Cómo descargo el libro?

A gentle introduction to SHAP values in R

Blog posts based on `funModeling`:

New `status` function

How to use `recipes` for one hot encoding

1st - How to create a `recipe`

3rd - How to do the `prep`

4th - Getting the data with `juice`!

What is the difference between `bake` and `juice`?

The solution! Use `step_novel`