<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:media="http://search.yahoo.com/mrss/"><channel><title><![CDATA[Data Science Heroes Blog]]></title><description><![CDATA[Data analysis with R]]></description><link>https://blog.datascienceheroes.com/</link><image><url>https://blog.datascienceheroes.com/favicon.png</url><title>Data Science Heroes Blog</title><link>https://blog.datascienceheroes.com/</link></image><generator>Ghost 1.24</generator><lastBuildDate>Tue, 21 Apr 2026 04:38:03 GMT</lastBuildDate><atom:link href="https://blog.datascienceheroes.com/rss/" rel="self" type="application/rss+xml"/><ttl>60</ttl><item><title><![CDATA[funModeling: New site, logo and version 🚀]]></title><description><![CDATA[funModeling is focused on exploratory data analysis, data preparation and the evaluation of models. 
Check the latest functions and website here :)]]></description><link>https://blog.datascienceheroes.com/funmodeling-new-site-and-version/</link><guid isPermaLink="false">5ee7854942cada04ad16216a</guid><category><![CDATA[data-science-live-book]]></category><category><![CDATA[funModeling]]></category><category><![CDATA[data science]]></category><category><![CDATA[ML]]></category><category><![CDATA[rstats]]></category><category><![CDATA[R]]></category><dc:creator><![CDATA[Pablo Casas]]></dc:creator><pubDate>Mon, 15 Jun 2020 15:16:51 GMT</pubDate><media:content url="https://blog.datascienceheroes.com/content/images/2020/06/Screen-Shot-2020-06-15-at-12.15.11.png" medium="image"/><content:encoded><![CDATA[<div class="kg-card-markdown"><img src="https://blog.datascienceheroes.com/content/images/2020/06/Screen-Shot-2020-06-15-at-12.15.11.png" alt="funModeling: New site, logo and version 🚀"><p>Hi there!</p>
<p>{tl;dr} Website, <a href="http://pablo14.github.io/funModeling/index.html">here</a> ✅</p>
<p>In case you don't know <code>funModeling</code> is the package I've been developing during the last years.</p>
<p>It's focused on exploratory data analysis, data preparation and the evaluation of models.</p>
<h2 id="news">News</h2>
<p>Yesterday I published the latest version which fixes one of the plots in <code>cross_plot</code>. But that's not as funny as the announcement of its new logo!</p>
<img src="https://s3.amazonaws.com/datascienceheroes.com/img/blog/funModeling_logo_hq.png" alt="funModeling: New site, logo and version 🚀" width="400px">
<p>Also... I added the <code>coord_plot</code>, useful when we are profiling any clustering model:</p>
<p><code>coord_plot(data=mtcars2, group_var=&quot;cluster&quot;, group_func=median, print_table=TRUE)</code></p>
<img src="https://blog.datascienceheroes.com/content/images/2020/06/coord_plot.png" alt="funModeling: New site, logo and version 🚀" width="600px">
<p>You can choose the summarization function (mean by default). Yeah... no more outlier biases in the mean, long live the percentiles!</p>
<p>Oh... and <code>coord_plot</code> produces, at the same time, a table with the results:</p>
<img src="https://blog.datascienceheroes.com/content/images/2020/06/coord_plot2.png" alt="funModeling: New site, logo and version 🚀" width="600px">
<img src="https://media.giphy.com/media/U3DMvIlfDoHnYX9atH/giphy.gif" alt="funModeling: New site, logo and version 🚀" width="500px">
<p>And it shows the underlying <code>funModeling</code> philosophy: little code, graphics and a table with results (easier to operate 🦾).</p>
<h2 id="blogpostsbasedonfunmodeling">Blog posts based on <code>funModeling</code>:</h2>
<ul>
<li><a href="https://blog.datascienceheroes.com/exploratory-data-analysis-in-r-intro/">Exploratory Data Analysis in R (introduction)</a></li>
<li><a href="https://blog.datascienceheroes.com/automatic-data-types-checking-in-predictive-models/">Automatic data types checking in predictive models</a></li>
<li><a href="https://blog.datascienceheroes.com/fast-data-exploration-for-predictive-modeling/">Fast data exploration for predictive modeling</a></li>
<li><a href="https://blog.datascienceheroes.com/discretization-recursive-gain-ratio-maximization/">New discretization method: Recursive information gain ratio maximization</a></li>
</ul>
<h2 id="officialpage">Official page</h2>
<ul>
<li><code>funModeling</code> <a href="http://pablo14.github.io/funModeling/">official webpage</a></li>
<li>Check the vignette <a href="http://pablo14.github.io/funModeling/articles/funModeling_quickstart.html">here</a>.</li>
</ul>
<h2 id="learndatascience">Learn Data Science</h2>
<img src="https://blog.datascienceheroes.com/content/images/2020/06/dslb_wood.jpg" alt="funModeling: New site, logo and version 🚀" width="300px">
<p>You can learn and apply more functions using the <a href="https://blog.datascienceheroes.com/funmodeling-new-site-and-version/livebook.datascienceheroes.com/">Data Science Live Book</a>. And buy a digital copy (name your price), <a href="https://livebook.datascienceheroes.com/download-book.html">here</a>.</p>
<p>Speak Spanish? Want to study #ML? 👉 <a href="https://escueladedatosvivos.ai">https://escueladedatosvivos.ai</a></p>
<p>Do you use <code>funModeling</code> for teaching? Please contact me I want to know more :)</p>
<hr>
<p>That's all for now!</p>
<img src="https://media.giphy.com/media/33E7ZjlQEMgF6kbkhY/giphy.gif" alt="funModeling: New site, logo and version 🚀" width="400px">
<p><a href="https://twitter.com/pabloc_ds">Twitter</a> | <a href="https://www.linkedin.com/in/pcasas/">LinkedIn</a></p>
</div>]]></content:encoded></item><item><title><![CDATA[Tips before migrating to a newer R version]]></title><description><![CDATA[A summary of common problems that my colleagues and I had when migrating R / packages to newer version.]]></description><link>https://blog.datascienceheroes.com/tips-before-migrating-to-a-newer-r-version/</link><guid isPermaLink="false">5ea615b542cada04ad162158</guid><category><![CDATA[data science]]></category><category><![CDATA[machine learning]]></category><category><![CDATA[R]]></category><dc:creator><![CDATA[Pablo Casas]]></dc:creator><pubDate>Tue, 28 Apr 2020 13:36:51 GMT</pubDate><media:content url="https://blog.datascienceheroes.com/content/images/2020/04/moss_R_fire-1.gif" medium="image"/><content:encoded><![CDATA[<div class="kg-card-markdown"><img src="https://blog.datascienceheroes.com/content/images/2020/04/moss_R_fire-1.gif" alt="Tips before migrating to a newer R version"><p>This post is based on real events.</p>
<p>Several times when I installed the latest version of R, and proceeded to install all the packages I had in the previous version, I encountered problems. It also applies when updating packages after a while.</p>
<p>I decided to make this post after seeing the community reception to a quick post I made:</p>
<p><img src="https://blog.datascienceheroes.com/content/images/2020/04/Screen-Shot-2020-04-28-at-10.09.41.png" alt="Tips before migrating to a newer R version"></p>
<p>This post -also available in Spanish <a href="https://escueladedatosvivos.ai/blog/207351/consejos-spara-migrar-r-y-sobrevivir-en-el-tiempo">here</a>- does not want to discourage the installation of R, on the contrary, to warn the &quot;dark side&quot; of the migration and make our projects stable over time.</p>
<p>Luckily the functions change for the better, or even much better as it is the case of the tidyverse suite.</p>
<hr>
<p>🗞 (A little announcement for those who speak Spanish 🇪🇸) 3-weeks ago I create the data school: <a href="https://EscuelaDeDatosVivos.AI">EscuelaDeDatosVivos.AI</a>, where you can find an introductory <strong>free</strong> R course for data science (which includes the <code>tidyverse</code> and <code>funModeling</code> among others) 👉 <a href="https://escueladedatosvivos.ai/p/curso-desembarcando-en-r-2da-edicion-gratis">Desembarcando en R</a></p>
<hr>
<h3 id="projectsthatarenotfrequentlyexecuted">Projects that are not frequently executed</h3>
<p>For example, post migration in the run to generate the <a href="https://livebook.datascienceheroes.com/">Data Science Live Book</a> (written 100% in R markdown), I have seen function depreciation messages as a warning. Naturally I have to remove them or use the new function.</p>
<p>I also had the case where they changed some of the parameters of R Markdown.</p>
<h4 id="anothercase">Another case 🎬</h4>
<p>Imagine the following flow: <strong>R 4.0.0</strong> is installed, then the latest version of all packages. Taking ggplot as an example, we go from, 2.8.1 to 3.5.1.</p>
<p>Version 3.5.1 doesn't have a function because it is deprecated, ergo it fails. Or even changed a function (example from tidyverse: <code>mutate_at</code>, <code>mutate_if</code>). It changes what is called the signature of the function, e.g. the <code>.vars</code> parameter.</p>
<h3 id="packageinstallation">Package installation</h3>
<p><img src="https://blog.datascienceheroes.com/content/images/2020/04/pasted-image-0.png" alt="Tips before migrating to a newer R version"></p>
<p>Well, if we migrate and don't install everything we had before, we're going to run an old script and have this problem.</p>
<p>Some recommend listing all the packages we have installed, and generating a script to install them.</p>
<p>Another solution is to manually copy the packages from a folder of the old version of R to the new one. The packages are folders within the installation of R.</p>
<h3 id="ronservers">R on servers</h3>
<p>Another case, they have R installed on a server with processes running every day, they do the migration and some of the functions change their signature. That is, they change the type of data that perhaps is defined in a function.</p>
<p>This point should not occur often if one migrates from package versions often. The normal flow for removing a function from an R package is to first announce a with a warning the famous <code>deprecated</code>: <a href="https://stackoverflow.com/questions/44622054/mark-a-function-as-deprecated-in-customised-r-package">Mark a function as deprecated in customised R package.</a></p>
<p>If the announcement is in an N+1 version, and we switch from N to the N+2 version, we may miss the message and the function is no longer used.</p>
<h3 id="soitisnotadvisabletoupgradepackagesandr">So it is not advisable to upgrade packages and R?</h3>
<p>As I said at the beginning, of course I encourage the migration.</p>
<p>We must be alert and test the projects we already have running.</p>
<p>Otherwise, we wouldn't have many of the facilities that today's languages give us through the use of the community. It is not even dependent on R.</p>
<hr>
<p>📝 Now that the <code>tidymodels</code> is out, here's another post that might interest you: <a href="https://blog.datascienceheroes.com/how-to-use-recipes-package-for-one-hot-encoding/">How to use <code>recipes</code> package from <code>tidymodels</code> for one hot encoding 🛠</a></p>
<hr>
<h3 id="someadviceenvironments">Some advice: Environments</h3>
<p><img src="https://blog.datascienceheroes.com/content/images/2020/04/moss_exintor.gif" alt="Tips before migrating to a newer R version"></p>
<p>Python has a very useful concept that is the virtual environment, it is created quickly, and what it causes is that each library installation is done in the project folder.</p>
<p>Then you do <code>pip freeze &gt; requirements.txt</code> and all the libraries with their version remain in a txt with which they can quickly recreate the environment with which they developed. <a href="https://medium.com/@boscacci/why-and-how-to-make-a-requirements-txt-f329c685181e">Why and How to make a Requirements.txt</a></p>
<p>This is not so easy in R, there is <a href="https://rstudio.github.io/packrat/">packrat</a> but it has its complexities, for example if there are repos in github.</p>
<p><a href="https://www.hasselpunk.com/">Augusto Hassel</a> just told me about the <a href="https://rstudio.github.io/renv/articles/renv.html">renv</a> library (also from RStudio! 👏). I quote the page:</p>
<blockquote>
<p>&quot;The renv package is a new effort to bring project-local R dependency management to your projects. The goal is for renv to be a robust, stable replacement for the Packrat package, with fewer surprises and better default behaviors.&quot;</p>
</blockquote>
<p>You can see the slides from <code>renv</code>: <a href="https://resources.rstudio.com/rstudio-conf-2020/renv-project-environments-for-r-kevin-ushey">Project Environments for R</a>, by Kevin Ushey.</p>
<h3 id="docker">Docker</h3>
<p><img src="https://blog.datascienceheroes.com/content/images/2020/04/docker.png" alt="Tips before migrating to a newer R version"></p>
<p>Augusto also told me about Docker as a solution:</p>
<blockquote>
<p>&quot;Using Docker we can encapsulate the environment needed to run our code through an instruction file called Dockerfile. This way, we'll always be running the same image, wherever we pick up the environment.&quot;</p>
</blockquote>
<p>Here's a post by him (in Spanish): <a href="https://www.hasselpunk.com/blog/miprimerrepositorioendocker/">My First Docker Repository</a></p>
<h3 id="conclusions">Conclusions</h3>
<p>✅ If you have R in production, have a testing environment and a production environment.</p>
<p>✅ Install R, your libraries, and then check that everything is running as usual.</p>
<p>✅ Have <a href="https://en.wikipedia.org/wiki/Unit_testing">unit test</a> to automatically test that the data flow is not broken. In R check: <a href="https://testthat.r-lib.org/">testthat</a>.</p>
<p>✅ Update all libraries every X months, don't let too much time go by.</p>
<p>As a moral, this is also being data scientist, solving version, installation and environment problems.</p>
<hr>
<p>Moss! What did you think of the post?</p>
<p><img src="https://blog.datascienceheroes.com/content/images/2020/04/happy_moss.gif" alt="Tips before migrating to a newer R version"></p>
<p>Happy update!</p>
<p>📬 Find me at: <a href="https://www.linkedin.com/in/pcasas/">Linkedin</a> &amp; <a href="https://twitter.com/pabloc_ds">Twitter</a>.</p>
</div>]]></content:encoded></item><item><title><![CDATA[SPAM detection using fastai ULMFiT - Part 1: Language Model]]></title><description><![CDATA[Tutorial to fastai ULMFiT model for classification texts
(and some of the theory behind it) 🤖📚]]></description><link>https://blog.datascienceheroes.com/spam-detection-using-fastai-ulmfit-part-1-language-model/</link><guid isPermaLink="false">5df6cfe142cada04ad1620e2</guid><category><![CDATA[fastai]]></category><category><![CDATA[Python]]></category><category><![CDATA[deep-learning]]></category><category><![CDATA[AI]]></category><category><![CDATA[NLP]]></category><category><![CDATA[ULMFiT]]></category><dc:creator><![CDATA[Pablo Casas]]></dc:creator><pubDate>Mon, 23 Dec 2019 15:39:20 GMT</pubDate><media:content url="https://blog.datascienceheroes.com/content/images/2019/12/iRobot_SiP-1.jpg" medium="image"/><content:encoded><![CDATA[<div class="kg-card-markdown"><img src="https://blog.datascienceheroes.com/content/images/2019/12/iRobot_SiP-1.jpg" alt="SPAM detection using fastai ULMFiT - Part 1: Language Model"><p>tl;dr: 👉 show me the code!  🔥 <strong><a href="https://colab.research.google.com/drive/1fuJg9TyfsgLCzlWQ4_Etu3LJzC9Xrl8B">here</a></strong> 🔥</p>
<img src="https://media.giphy.com/media/DwrnYsZCXspu8/giphy.gif" width="400px" alt="SPAM detection using fastai ULMFiT - Part 1: Language Model">
<p><strong>UPDATE Feb.21.2020</strong> Part 2, the classification model, is <a href="https://colab.research.google.com/drive/18wX1iUwzw-GQd6kI9HcC9hr4g6gZBRRT">here</a></p>
<h2 id="nontechnicalintroduction">Non-technical introduction</h2>
<p>Imagine you are a lawyer, that wants to study medicine; although it is a huge change, the underlying idea is you know how to speak in English, know the semantics to create a text, and the language rules.</p>
<p>So when you jump into medicine, you don't have to learn from scratch that after the word &quot;They&quot;, it comes the word &quot;were&quot; (not &quot;was&quot;).</p>
<p>You only learn the particularities of the domain field (medicine).<br>
But what is <strong>ULMFiT</strong>? 📚🤖</p>
<p>ULMFiT stands for Universal Language Model Fine-tuning, and its implementation is in <code>fastai</code> pythons library.</p>
<h3 id="whyisituseful">Why is it useful? 🤔</h3>
<p>It allows us to save time when creating an NLP project, thanks to the <strong>transfer learning</strong> technique, we do only need to <strong>fine-tune</strong> the network to our data. Let's say, it learns the domain field words.</p>
<p>Especially handy if we don't have lots of data.</p>
<img src="https://media.giphy.com/media/13TfEn74wWwZAQ/giphy.gif" alt="SPAM detection using fastai ULMFiT - Part 1: Language Model">
<h3 id="aboutgooglecolab">About google colab</h3>
<p>Not new, but google colab is a tool that allows us to run notebook python projects using the GPUs from google servers. It's free!</p>
<p>This two-blog post series can be run in your browser, only by executing all the cells! Time to play :)</p>
<p>Besides running the uploaded version, you can copy the project directly to your google drive and do all the practice you want! (<em>File -&gt; Save a copy in drive</em>)</p>
<p>Read more: <a href="https://medium.com/deep-learning-turkey/google-colab-free-gpu-tutorial-e113627b9f5d">Google Colab Free GPU Tutorial</a></p>
<h2 id="goingmoretechnical">Going more technical</h2>
<p>The project is split into:</p>
<p>1- Create the language model<br>
2- Create the classification model</p>
<p>The language model is what handles the word and semantics representations, and it can be chained to the classification model quickly.</p>
<p>I suggest you read <a href="https://arxiv.org/abs/1801.06146">Universal Language Model Fine-tuning for Text Classification</a>. It was created by <a href="https://twitter.com/jeremyphoward">Jeremy Howard</a> and Sebastian Rude.</p>
<p>ULMFit contains a network that was trained on a corpus of 103MM Wikipedia articles. So it already knows how to speak &quot;neutral&quot;.</p>
<p><img src="https://blog.datascienceheroes.com/content/images/2019/12/ulmfit.png" alt="SPAM detection using fastai ULMFiT - Part 1: Language Model"></p>
<p>Source: arXiv:1801.06146v5</p>
<ul>
<li><strong>Part 1</strong>: of this post is about section (a) and (b): Download pre-trained language model and do the fine-tuning with our data.</li>
<li><strong>Part 2:</strong> (c) create the classification model.</li>
</ul>
<p>📚 Learn more from:</p>
<ul>
<li>Official web page: <a href="http://nlp.fast.ai/">http://nlp.fast.ai/</a></li>
<li><strong>fastai</strong> youtube lesson: <a href="https://youtu.be/vnOpEwmtFJ8?t=4511">https://youtu.be/vnOpEwmtFJ8?t=4511</a> (it starts at ULMFiT stage)</li>
</ul>
<h2 id="code">Code 💻</h2>
<p>This blog post assumes you have some prior knowledge in deep learning. But if not, I encourage you to run all the projects and playing by doing little changes in the code, and see what happens!</p>
<p>Some of the topics covered in the google colab, are:</p>
<ul>
<li>Pretrained model advantages (transfer learning)</li>
<li>ULMFiT in other languages? (other than English)</li>
<li>What is an embedding?</li>
<li>How to train a language model</li>
</ul>
<p>--</p>
<p>📌 Run the project here 👉 <strong><a href="https://colab.research.google.com/drive/1fuJg9TyfsgLCzlWQ4_Etu3LJzC9Xrl8B">google colab</a></strong></p>
<p><strong>UPDATE Feb.21.2020</strong> Don't forget to check Part 2, the classification model, <a href="https://colab.research.google.com/drive/18wX1iUwzw-GQd6kI9HcC9hr4g6gZBRRT">here</a></p>
<hr>
<p>Have data fun! 🚀</p>
<p>📬 Find me at: <a href="https://www.linkedin.com/in/pcasas/">Linkedin</a> &amp; <a href="https://twitter.com/pabloc_ds">Twitter</a>.<br>
<a href="https://livebook.datascienceheroes.com/">Data Science Live Book</a> 📗</p>
</div>]]></content:encoded></item><item><title><![CDATA[How Auth0’s Data Team uses R and Python]]></title><description><![CDATA[Auth0 Data Team shares their tooling, from R to Python, their favourite open-souce libraries for data science and data engineering 🛠]]></description><link>https://blog.datascienceheroes.com/how-auth0-data-team-uses-r-and-python/</link><guid isPermaLink="false">5de6630f42cada04ad1620ce</guid><category><![CDATA[Python]]></category><category><![CDATA[data engineering]]></category><category><![CDATA[data science]]></category><category><![CDATA[tidyverse]]></category><category><![CDATA[AWS]]></category><category><![CDATA[Airflow]]></category><category><![CDATA[fastai]]></category><category><![CDATA[R]]></category><dc:creator><![CDATA[Pablo Casas]]></dc:creator><pubDate>Tue, 03 Dec 2019 16:24:24 GMT</pubDate><media:content url="https://blog.datascienceheroes.com/content/images/2019/12/r-language-python.png" medium="image"/><content:encoded><![CDATA[<div class="kg-card-markdown"><img src="https://blog.datascienceheroes.com/content/images/2019/12/r-language-python.png" alt="How Auth0’s Data Team uses R and Python"><p>The Data team is responsible for crunching, reporting, and serving data. The team also does data integrations with other systems, creating machine learning, and deep learning models.</p>
<p>With this post, we intend to share our favorite tools, which are proven to run with thousands of millions of data.<br>
Scaling processes in real-world scenarios is a hot topic among new people coming to data.</p>
<p><em>This post first appeared at: <a href="https://auth0.com/blog/how-the-auth0-data-team-uses-r-and-python/">https://auth0.com/blog/how-the-auth0-data-team-uses-r-and-python/</a></em></p>
<h2 id="rorpython">R or Python?</h2>
<p>Well... both!</p>
<p>R is a GNU project, thought as a statistical data language originally developed at Bell Laboratories around 1996.</p>
<p>Python, developed in 1991 by Guido van Rossum, is a general-purpose language with a focus on code readability.</p>
<p>Both R and Python are highly extensible through packages.</p>
<p>We mainly use R for our data processes and ML projects, and Python to do the integrations and Deep Learning projects.</p>
<p>Our stack is R with RStudio, and Python 3 with Jupyter notebooks.</p>
<p><img src="https://blog.datascienceheroes.com/content/images/2019/12/r_py.png" alt="How Auth0’s Data Team uses R and Python"></p>
<p><a href="https://rstudio.com">RStudio</a> is an open-source and vast IDE capable of browsing data and objects created during the session, plots, debugging code, among many other options. It also provides an enterprise-ready solution.</p>
<p><a href="https://jupyter.org/">Jupyter</a> is also an open-source IDE aimed to interface Julia, Python, and R. Today's is widely used for data scientists to share their analysis. Recently Google creates &quot;Colab&quot;, a Jupyter notebook environment capable of running in the google drive cloud.</p>
<h2 id="soisrcapableofrunningonproduction">So is R capable of running on production?</h2>
<p>Yes.</p>
<p>We run several heavy data preparations and predictive models every day, every hour, and every few minutes.</p>
<h2 id="howdowerunrandpythontasksonproduction">How do we run R and Python tasks on production?</h2>
<p>We use <a href="https://airflow.apache.org/">Airflow</a> as an orchestrator, an open-source project created by Airbnb.</p>
<p>Airflow is an incredible and robust project which allows us to schedule processes, assign priorities, rules, detailed log, etc.</p>
<p>For development, we still use the form: <code>Rscript my_awesome_script.R</code>.</p>
<p>Airflow is a Python-based task scheduler that allows us to run chained processes, with many complex dependencies, monitoring the current state of all of them and firing alerts if anything goes wrong to Slack. This is ideal for running import jobs to populate the Data Warehouse with fresh data every day.</p>
<h2 id="dowehaveadatawarehouse">Do we have a data warehouse?</h2>
<p>Yes, and it's huge!</p>
<p>It's mounted on Amazon Redshift, a suitable option if scaling is a priority. Visit their <a href="https://docs.aws.amazon.com/redshift/latest/mgmt/welcome.html">website</a> to learn more about it.</p>
<p><img src="https://blog.datascienceheroes.com/content/images/2019/12/redshift.png" alt="How Auth0’s Data Team uses R and Python"></p>
<p>R connects directly to Amazon Redshift thanks to the <a href="https://github.com/auth0/rauth0">rauth0 package</a>, which uses the <a href="https://blog.datascienceheroes.com/redshifttools-v1-0-0-cran-release">redshiftTools</a> package, developed by <a href="https://auth0.com/blog/authors/pablo-seibelt/">Pablo Seibelt</a>.</p>
<p>Generally, data is uploaded from R to Amazon Redshift using <code>redshiftTools</code>.<br>
This data can be either plain files or from data frames created during the R session.</p>
<p>We use Python to import and export unstructured data since R does not have useful libraries currently to handle it.</p>
<p>We have experimented with JSON libraries in R but the result is much worse than using Python in this scenario. For example, using <a href="https://rdrr.io/cran/RJSONIO/man/fromJSON.html">RJSONIO</a> the dataset is automatically transformed into an R Data Frame, with little control of how the transformation is done. This is only useful for very simple JSON data structures and is very difficult to manipulate in R, compared to Python where this is much easier and more natural.</p>
<h2 id="howdowedealwithdatapreparationusingr">How do we deal with data preparation using R?</h2>
<p>We have two scenarios, data preparation for data engineering, and data preparation for machine learning/AI.</p>
<p>One of the biggest strengths of R is the <a href="https://www.tidyverse.org/packages">tidyverse</a> package, which is a set of packages developed by lots of ninja developers, some of them working at RStudio Inc company. They provide a common API and a shared philosophy for working with data. We will cover an example in the next section.</p>
<p><img src="https://blog.datascienceheroes.com/content/images/2019/12/tidyverse.png" alt="How Auth0’s Data Team uses R and Python"></p>
<p>The tidyverse, especially the <a href="https://dplyr.tidyverse.org">dplyr</a> package, contains a set of functions that make the exploratory data analysis and data preparation quite comfortable.</p>
<p>For certain tasks in crunching data prep and visualization, we use the <a href="https://blog.datascienceheroes.com/exploratory-data-analysis-data-preparation-with-funmodeling/">funModeling</a> package. It was the seed for an open-source book I published some time ago: <a href="http://livebook.datascienceheroes.com/">Data Science Live Book</a>.<br>
It contains some good practices we follow related to deploying models on production, dealing with missing data, handling outliers, and more.</p>
<h2 id="doesrscale">Does R scale?</h2>
<p>One of the key points of <a href="https://dbplyr.tidyverse.org/">dplyr</a> is it can be run on databases, thanks to another package with a pretty similar name: dbplyr.</p>
<p>This way, we write R syntax (<code>dplyr</code>) and it is &quot;automagically&quot; converted to SQL syntax and it then runs on production.</p>
<p>There are some cases in which these conversions from R to SQL are not made automatically. For such cases, we are still able to do a mix of SQL syntax in R.</p>
<p>For example, following dplyr syntax:</p>
<p>flights %&gt;%<br>
group_by(month, day) %&gt;%<br>
summarise(delay = mean(dep_delay))</p>
<p>Generates:</p>
<p>SELECT <code>month</code>, <code>day</code>, AVG(<code>dep_delay</code>) AS <code>delay</code><br>
FROM <code>nycflights13::flights</code><br>
GROUP BY <code>month</code>, <code>day</code></p>
<p>This way, <code>dbplyr</code> makes transparent for the R user working with objects in RAM or in a foreign database.</p>
<p>Not many people know, but many key pieces of R are written in C++ (concretely, the <a href="http://adv-r.had.co.nz/Rcpp.html">Rcpp</a> package).</p>
<h2 id="howdowesharetheresults">How do we share the results?</h2>
<p>Mostly in <a href="https://www.tableau.com/">Tableau</a>. We have some integrations with <a href="https://www.salesforce.com">Salesforce</a>.</p>
<p>In addition, we do have some reports deployed in <a href="https://shiny.rstudio.com">Shiny</a>. Especially the ones that need complex customer interaction.<br>
Shiny allows custom reports to be built using simple R code without having to learn Javascript, Python or other frontend and backend languages. Through the use of a &quot;reactive&quot; interface, the user can input parameters that the Shiny application can use to react and redraw any reports. In contrast with tools like Tableau, Domo, PowerBI, etc. which are more &quot;drag and drop&quot;, the programmatic nature of Shiny apps allow them to do almost anything the developer can conceive in their imagination, which might be more difficult or impossible in other tools.</p>
<p><img src="https://blog.datascienceheroes.com/content/images/2019/12/rmarkdown.png" alt="How Auth0’s Data Team uses R and Python"></p>
<p>For ad hoc reports (HTML), we use <a href="https://rmarkdown.rstudio.com">R markdown</a> which shares some functionality with to jupyter notebooks. It allows a script to be created with an analysis that ends in a dashboard, PDF report, web-based reports, and also books!</p>
<h2 id="machinelearningai">Machine Learning / AI</h2>
<p>We use both R and Python.</p>
<p>For Machine Learning projects, we use mainly the <a href="https://topepo.github.io/caret/index.html">caret</a> package in R. It provides a high-level interface to many machine learning algorithms, as well as common tasks in data preparation, model evaluation, and hyper-tuning parameter.</p>
<p>For Deep Learning, we use Python, specifically the libraries <a href="https://keras.io">Keras</a> with <a href="https://www.tensorflow.org/">TensorFlow</a> as the backend.<br>
Keras is an API to build with just a bunch of lines of code, many of the most complex neural networks. It can easily scale by training them on the cloud, in services like AWS.</p>
<p>Nowadays we are also doing some experiments with <a href="https://www.fast.ai/">the fastai</a> library for NLP problems.</p>
<h2 id="summingup">Summing up!</h2>
<p>The open-source languages are leading the data path. R and Python have strong communities, and there are free and top-notch resources to learn.</p>
<p>Here we wanted to share the not-so-common approach of using R for data engineering tasks, what are our favorite and Python libraries, with a focus on sharing the results, explaining some of the practices we do every day.</p>
<p>We think the most important stages in a data project are the data analysis and data preparation. Choosing the right approach can save a lot of time and make the project to scale.</p>
<p>We hope this post encourages you to try some of the suggested technologies and rock your data projects!</p>
<p>--</p>
<p>Any Questions? Leave it in the comments 📨</p>
</div>]]></content:encoded></item><item><title><![CDATA[Automatic data types checking in predictive models]]></title><description><![CDATA[Given certain data, and we need to create models (xgboost, random forest, regression, etc). Each one of them has its constraints regarding data types. 
Errors are not clear, here's a new function to speed up model creation.]]></description><link>https://blog.datascienceheroes.com/automatic-data-types-checking-in-predictive-models/</link><guid isPermaLink="false">5d9ce53142cada04ad1620ba</guid><category><![CDATA[data cleaning]]></category><category><![CDATA[data preparation]]></category><category><![CDATA[R]]></category><dc:creator><![CDATA[Pablo Casas]]></dc:creator><pubDate>Mon, 14 Oct 2019 14:50:57 GMT</pubDate><media:content url="https://blog.datascienceheroes.com/content/images/2019/10/asorted-1.png" medium="image"/><content:encoded><![CDATA[<div class="kg-card-markdown"><img src="https://blog.datascienceheroes.com/content/images/2019/10/asorted-1.png" alt="Automatic data types checking in predictive models"><p><strong>The problem</strong>: We have data, and we need to create models (xgboost, random forest, regression, etc). Each one of them has its constraints regarding data types.<br>
Many <em>strange</em> errors appear when we are creating models just because of data format.</p>
<p>The new version of <code>funModeling</code> 1.9.3 (Oct 2019) aimed to provide quick and clean assistance on this.</p>
<p><em>Cover photo by: @<a href="https://unsplash.com/@franjacquier">franjacquier_</a></em></p>
<h2 id="tldrcode">tl;dr;code 💻</h2>
<p>Based on some <em>messy</em> data, we want to run a random forest, so before getting some weird errors, we can check...</p>
<p>Example 1:</p>
<pre><code class="language-r"># install.packages(&quot;funModeling&quot;)
library(funModeling)
library(tidyverse)

# Load data
data=read_delim(&quot;https://raw.githubusercontent.com/pablo14/data-integrity/master/messy_data.txt&quot;, delim = ';')

# Call the function:
integ_mod_1=data_integrity_model(data = data, model_name = &quot;randomForest&quot;)

# Any errors?
integ_mod_1
</code></pre>
<pre><code>## 
## ✖ {NA detected} num_vessels_flour, thal, gender
## ✖ {Character detected} gender, has_heart_disease
## ✖ {One unique value} constant
</code></pre>
<p>Regardless the &quot;one unique value&quot;, the other errors need to be solved in order to create a random forest.</p>
<img src="https://blog.datascienceheroes.com/content/images/2019/09/giphy2.gif" width="200px" alt="Automatic data types checking in predictive models">
<p>Alghoritms have their own data type restrictions, and their own error messages making the execution a hard debugging task... <code>data_integrity_model</code> will alert with a common error message about such errors.</p>
<h2 id="introduction">Introduction</h2>
<p><code>data_integrity_model</code> is built on top of <code>data_integrity</code> function. We talked about it in the post: <a href="https://blog.datascienceheroes.com/fast-data-exploration-for-predictive-modeling/">Fast data exploration for predictive modeling</a>.</p>
<p>It checks:</p>
<ul>
<li><code>NA</code></li>
<li>Data types (allow non-numeric? allow character?)</li>
<li>High cardinality</li>
<li>One unique value</li>
</ul>
<h2 id="supportedmodels">Supported models 🤖</h2>
<p>It takes the metadata from a table that is pre-loaded with <code>funModeling</code></p>
<pre><code class="language-r">head(metadata_models)
</code></pre>
<pre><code>## # A tibble: 6 x 6
##   name         allow_NA max_unique allow_factor allow_character only_numeric
##   &lt;chr&gt;        &lt;lgl&gt;         &lt;dbl&gt; &lt;lgl&gt;        &lt;lgl&gt;           &lt;lgl&gt;       
## 1 randomForest FALSE            53 TRUE         FALSE           FALSE       
## 2 xgboost      TRUE            Inf FALSE        FALSE           TRUE        
## 3 num_no_na    FALSE           Inf FALSE        FALSE           TRUE        
## 4 no_na        FALSE           Inf TRUE         TRUE            TRUE        
## 5 kmeans       FALSE           Inf TRUE         TRUE            TRUE        
## 6 hclust       FALSE           Inf TRUE         TRUE            TRUE
</code></pre>
<p>The idea is anyone can add the most popular models or some configuration that is not there.<br>
There are some redundancies, but the purpose is to focus on the model, not the needed metadata.<br>
This way we don't think in <code>no NA</code> in random forest, we just write <code>randomForest</code>.</p>
<p>Some custom configurations:</p>
<ul>
<li><code>no_na</code>: no NA variables.</li>
<li><code>num_no_na</code>: numeric with no NA (for example, useful when doing deep learning).</li>
</ul>
<h2 id="embedinadataflowonproduction">Embed in a data flow on production 🚚</h2>
<p>Many people ask for typical questions when interviewing candidates. I like these ones: <em>&quot;How do you deal with new data?&quot;</em> or <em>&quot;What are the considerations you have when you do a deploy?&quot;</em></p>
<p>Based on our first example:</p>
<pre><code class="language-r">integ_mod_1
</code></pre>
<pre><code>## 
## ✖ {NA detected} num_vessels_flour, thal, gender
## ✖ {Character detected} gender, has_heart_disease
## ✖ {One unique value} constant
</code></pre>
<p>We can check:</p>
<pre><code class="language-r">integ_mod_1$data_ok
</code></pre>
<pre><code>## [1] FALSE
</code></pre>
<p><code>data_ok</code> is a flag useful to stop a process raising an error if anything goes wrong.</p>
<h2 id="moreexamples">More examples 🎁</h2>
<p>Example 2:</p>
<p>On <code>mtcars</code> data frame, check if there is any variable with <code>NA</code>:</p>
<pre><code class="language-r">di2=data_integrity_model(data = mtcars, model_name = &quot;no_na&quot;)

# Check:
di2
</code></pre>
<pre><code>## ✔ Data model integrity ok!
</code></pre>
<p>Good to go?</p>
<pre><code class="language-r">di2$data_ok
</code></pre>
<pre><code>## [1] TRUE
</code></pre>
<p>Example 3:</p>
<pre><code class="language-r">data_integrity_model(data = heart_disease, model_name = &quot;pca&quot;)
</code></pre>
<pre><code>## 
## ✖ {NA detected} num_vessels_flour, thal
## ✖ {Non-numeric detected} gender, chest_pain, fasting_blood_sugar, resting_electro, thal, exter_angina, has_heart_disease
</code></pre>
<p>Example 4:</p>
<pre><code class="language-r">data_integrity_model(data = iris, model_name = &quot;kmeans&quot;)
</code></pre>
<pre><code>## 
## ✖ {Non-numeric detected} Species
</code></pre>
<h2 id="anysuggestions">Any suggestions?</h2>
<p>If you come across any cases which aren't covered here, you are welcome to contribute: <a href="https://github.com/pablo14/funModeling">funModeling's github</a>.</p>
<p>How about time series? I took them as: numeric with no na (<code>model_name = num_no_na</code>). You can add any new model by updating the table <code>metadata_models</code>.</p>
<p>And that's it.</p>
<hr>
<p>In case you want to understand more about data types and qualilty, you can check the <em><a href="https://livebook.datascienceheroes.com/">Data Science Live Book</a></em> 📗</p>
<p>Have data fun! 🚀</p>
<p>📬 You can found me at: <a href="https://www.linkedin.com/in/pcasas/">Linkedin</a> &amp; <a href="https://twitter.com/pabloc_ds">Twitter</a>.</p>
</div>]]></content:encoded></item><item><title><![CDATA[Fast data exploration for predictive modeling]]></title><description><![CDATA[Before predictive model creation, we need to check/change numerical, categorical, NAs, one unique value and high cardinality variables. This new function will assist us in this task.]]></description><link>https://blog.datascienceheroes.com/fast-data-exploration-for-predictive-modeling/</link><guid isPermaLink="false">5d81300942cada04ad1620b1</guid><category><![CDATA[data preparation]]></category><category><![CDATA[exploratory data analysis]]></category><category><![CDATA[R]]></category><dc:creator><![CDATA[Pablo Casas]]></dc:creator><pubDate>Wed, 18 Sep 2019 15:42:29 GMT</pubDate><media:content url="https://blog.datascienceheroes.com/content/images/2019/09/giphy.gif" medium="image"/><content:encoded><![CDATA[<div class="kg-card-markdown"><img src="https://blog.datascienceheroes.com/content/images/2019/09/giphy.gif" alt="Fast data exploration for predictive modeling"><p><strong>The problem</strong>: Before modeling, we need to check/change numerical, categorical, NAs, one unique value and high cardinality variables.</p>
<p>The new version of <code>funModeling</code> 1.9.2 was released aimed to have assistance during the prior step in creating machine learning models.</p>
<p><em>This post has its continues on <a href="https://blog.datascienceheroes.com/automatic-data-types-checking-in-predictive-models/">Automatic data types checking in predictive models</a></em></p>
<h2 id="introduction">Introduction</h2>
<p><code>data_integrity</code> function provide information about the format of all the variables, as well as some short stats about <code>NA</code> values.</p>
<p>This way we can select and transform the variables, keeping them in the format we need.</p>
<pre><code class="language-r"># install.packages(&quot;funModeling&quot;)
library(funModeling)
</code></pre>
<h2 id="loadthemessydata">Load the <em>messy</em> data:</h2>
<img src="https://blog.datascienceheroes.com/content/images/2019/09/messi.png" width="200px" alt="Fast data exploration for predictive modeling">
<pre><code class="language-r">library(tidyverse)
data=read_delim(&quot;https://raw.githubusercontent.com/pablo14/data-integrity/master/messy_data.txt&quot;, delim = ';')
</code></pre>
<p>Now we call to <code>data_integrity</code> function, which returns an <code>integrity</code> object:</p>
<pre><code class="language-r">di=data_integrity(data)
</code></pre>
<p>Then, <code>summary</code> function gives us a quick self-explanatory overview :</p>
<pre><code class="language-r">summary(di)
</code></pre>
<pre><code>## 
## ◌ {Numerical with NA} num_vessels_flour, thal
## ◌ {Categorical with NA} gender
## ● {One unique value} constant
</code></pre>
<p>Now we can apply <code>mutate_at</code>, <code>select</code>, or apply other function over certain and specific columns.</p>
<p>In case we need the variable name as a vector of strings, we can use the RStudio bare-combine add-in:</p>
<blockquote class="twitter-tweet"><p lang="en" dir="ltr">My keyboard shortcut for this lil&#39; function gets quite the workout…<br>📺 &quot;hrbraddins::bare_combine()&quot; by <a href="https://twitter.com/hrbrmstr?ref_src=twsrc%5Etfw">@hrbrmstr</a> <a href="https://t.co/8dwqNEso0B">https://t.co/8dwqNEso0B</a> <a href="https://twitter.com/hashtag/rstats?src=hash&amp;ref_src=twsrc%5Etfw">#rstats</a> <a href="https://t.co/gyqz2mUE0Y">pic.twitter.com/gyqz2mUE0Y</a></p>&mdash; Mara Averick (@dataandme) <a href="https://twitter.com/dataandme/status/1155842512743030785?ref_src=twsrc%5Etfw">July 29, 2019</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script> 
<p>The high cardinality max value can be changed using the parameter <code>MAX_UNIQUE</code></p>
<h2 id="accessingalltheinformation">Accessing all the information</h2>
<p>If we print the integrity object, we can see a lot of information regarding <code>NA</code>, numerical, categorical and other types, alongside the high cardinality variables:</p>
<pre><code class="language-r">di
</code></pre>
<pre><code>## $vars_num_with_NA
##            variable q_na       p_na
## 1 num_vessels_flour    4 0.01320132
## 2              thal    2 0.00660066
## 
## $vars_cat_with_NA
##   variable q_na       p_na
## 1   gender    1 0.00330033
## 
## $vars_cat_high_card
## [1] variable unique  
## &lt;0 rows&gt; (or 0-length row.names)
## 
## $MAX_UNIQUE
## [1] 35
## 
## $vars_one_value
## [1] &quot;constant&quot;
## 
## $vars_cat
## [1] &quot;gender&quot;            &quot;has_heart_disease&quot;
## 
## $vars_num
##  [1] &quot;age&quot;                    &quot;chest_pain&quot;             &quot;resting_blood_pressure&quot;
##  [4] &quot;serum_cholestoral&quot;      &quot;fasting_blood_sugar&quot;    &quot;resting_electro&quot;       
##  [7] &quot;max_heart_rate&quot;         &quot;exer_angina&quot;            &quot;oldpeak&quot;               
## [10] &quot;slope&quot;                  &quot;num_vessels_flour&quot;      &quot;thal&quot;                  
## [13] &quot;heart_disease_severity&quot; &quot;exter_angina&quot;           &quot;constant&quot;              
## [16] &quot;id&quot;                    
## 
## $vars_char
## [1] &quot;gender&quot;            &quot;has_heart_disease&quot;
## 
## $vars_factor
## character(0)
## 
## $vars_other
## [1] &quot;has_heart_disease2&quot; &quot;fecha&quot;              &quot;fecha2&quot;
</code></pre>
<p>And each object is accessible to operate quickly:</p>
<pre><code class="language-r">di$results$vars_num
</code></pre>
<pre><code>##  [1] &quot;age&quot;                    &quot;chest_pain&quot;             &quot;resting_blood_pressure&quot;
##  [4] &quot;serum_cholestoral&quot;      &quot;fasting_blood_sugar&quot;    &quot;resting_electro&quot;       
##  [7] &quot;max_heart_rate&quot;         &quot;exer_angina&quot;            &quot;oldpeak&quot;               
## [10] &quot;slope&quot;                  &quot;num_vessels_flour&quot;      &quot;thal&quot;                  
## [13] &quot;heart_disease_severity&quot; &quot;exter_angina&quot;           &quot;constant&quot;              
## [16] &quot;id&quot;
</code></pre>
<p>Numerical variables with <code>NA</code> values:</p>
<pre><code class="language-r">di$results$vars_num_with_NA$variable
</code></pre>
<pre><code>## [1] &quot;num_vessels_flour&quot; &quot;thal&quot;
</code></pre>
<img src="https://blog.datascienceheroes.com/content/images/2019/09/giphy2.gif" width="200px" alt="Fast data exploration for predictive modeling">
<p>Help page:</p>
<pre><code class="language-r">help(&quot;data_integrity&quot;)
</code></pre>
<h1 id="newstatusfunction">New <code>status</code> function</h1>
<p>This is the internal function used in <code>data_integrity</code>:</p>
<pre><code class="language-r">status(heart_disease)
</code></pre>
<pre><code>##                  variable q_zeros   p_zeros q_na       p_na q_inf p_inf    type unique
## 1                     age       0 0.0000000    0 0.00000000     0     0 integer     41
## 2                  gender       0 0.0000000    0 0.00000000     0     0  factor      2
## 3              chest_pain       0 0.0000000    0 0.00000000     0     0  factor      4
## 4  resting_blood_pressure       0 0.0000000    0 0.00000000     0     0 integer     50
## 5       serum_cholestoral       0 0.0000000    0 0.00000000     0     0 integer    152
## 6     fasting_blood_sugar     258 0.8514851    0 0.00000000     0     0  factor      2
## 7         resting_electro     151 0.4983498    0 0.00000000     0     0  factor      3
## 8          max_heart_rate       0 0.0000000    0 0.00000000     0     0 integer     91
## 9             exer_angina     204 0.6732673    0 0.00000000     0     0 integer      2
## 10                oldpeak      99 0.3267327    0 0.00000000     0     0 numeric     40
## 11                  slope       0 0.0000000    0 0.00000000     0     0 integer      3
## 12      num_vessels_flour     176 0.5808581    4 0.01320132     0     0 integer      4
## 13                   thal       0 0.0000000    2 0.00660066     0     0  factor      3
## 14 heart_disease_severity     164 0.5412541    0 0.00000000     0     0 integer      5
## 15           exter_angina     204 0.6732673    0 0.00000000     0     0  factor      2
## 16      has_heart_disease       0 0.0000000    0 0.00000000     0     0  factor      2
</code></pre>
<p>It's another version of <code>df_status</code>, where percentages are expressed in the range o 0 to 1 (not 0 to 100). More intuitive to use in filters</p>
<p>This is the same object as <code>di$status_now</code>.</p>
<h2 id="nextrealase">Next realase?</h2>
<p>It will contain, based on <code>data_integrity</code>, an automated data quality test suited for the predictive model we need to run.<br>
Found this task quite important and repetitive when I teach. Hopefully it will save some time!</p>
<h2 id="furtherreading">Further reading</h2>
<p>All of these topics are covered in deep in the <em>Data Science Live Book</em> 📗:</p>
<ul>
<li><a href="https://livebook.datascienceheroes.com/exploratory-data-analysis.html#profiling">Dataset status</a></li>
<li><a href="https://livebook.datascienceheroes.com/data-preparation.html#data_types">Data types in predictive modeling</a></li>
<li><a href="https://livebook.datascienceheroes.com/data-preparation.html#high_cardinality_predictive_modeling">High cardinallity variables</a></li>
<li><a href="https://livebook.datascienceheroes.com/data-preparation.html#missing_data">Handling Missing data</a></li>
</ul>
<hr>
<p>Have fun! 🚀</p>
<p>📬 You can found me at: <a href="https://www.linkedin.com/in/pcasas/">Linkedin</a> &amp; <a href="https://twitter.com/pabloc_ds">Twitter</a>.</p>
</div>]]></content:encoded></item><item><title><![CDATA[How to use `recipes` package from `tidymodels` for one hot encoding 🛠]]></title><description><![CDATA[Quick introduction to `recipes` package, from the `tidymodels` family, based on one hot encoding. 
Useful to automatize some data preparation tasks.]]></description><link>https://blog.datascienceheroes.com/how-to-use-recipes-package-for-one-hot-encoding/</link><guid isPermaLink="false">5d20a04342cada04ad16209f</guid><category><![CDATA[machine learning]]></category><category><![CDATA[data preparation]]></category><category><![CDATA[recipes]]></category><category><![CDATA[R]]></category><dc:creator><![CDATA[Pablo Casas]]></dc:creator><pubDate>Mon, 08 Jul 2019 16:52:43 GMT</pubDate><media:content url="https://blog.datascienceheroes.com/content/images/2019/07/one-hot-recipes-1.png" medium="image"/><content:encoded><![CDATA[<div class="kg-card-markdown"><img src="https://blog.datascienceheroes.com/content/images/2019/07/one-hot-recipes-1.png" alt="How to use `recipes` package from `tidymodels` for one hot encoding 🛠"><p>Since once of the best way to learn, is to explain, I want to share with you this quick introduction to <code>recipes</code> package, from the <code>tidymodels</code> family.<br>
It can help us to automatize some data preparation tasks.</p>
<p>The overview is:</p>
<ul>
<li>How to create a <code>recipe</code></li>
<li>How to add a <code>step</code></li>
<li>How to do the <code>prep</code></li>
<li>Getting the data with <code>juice</code>!</li>
<li>Apply the prep to new data</li>
<li>What is the difference between <code>bake</code> and <code>juice</code>?</li>
<li>Dealing with new values in recipes (<code>step_novel</code>)</li>
</ul>
<p>Since I'm new to this package, if you have something to add just put in the comments ;)</p>
<h2 id="introduction">Introduction</h2>
<p>If you are new to R or you do a 1-time analysis, you could not see the main advantage of this, which is -in my opinion- to have most of the <strong>data preparation</strong> steps in one place. This way is easier to split between dev and prod.</p>
<ul>
<li>Dev: The stage in which we create the model</li>
<li>Prod: The moment in which we run the model with new data</li>
</ul>
<p>The other big advantage is it follows the <em>tidy</em> philosophy, so many things will be familiar.</p>
<h2 id="howtouserecipesforonehotencoding">How to use <code>recipes</code> for one hot encoding</h2>
<p>It is focused on <strong>one hot encoding</strong>, but many other functions like scaling, applying PCA and others can be performed.</p>
<p>But first, <strong>what is one hot encoding?</strong></p>
<p>It's a data preparation technique to convert all the categorical variables into numerical, by assigning a value of <code>1</code> when the row belongs to the category. If the variable has 100 unique values, the final result will contain 100 columns.</p>
<img src="https://blog.datascienceheroes.com/content/images/2019/07/one-hot-encoding.png" width="300px" alt="How to use `recipes` package from `tidymodels` for one hot encoding 🛠">
<p>That's why it is a <strong>good practice to reduce the cardinality</strong> of the variable before continuing Learn more about it in the <a href="https://livebook.datascienceheroes.com/data-preparation.html#high_cardinality_predictive_modeling">High Cardinality Variable in Predictive Modeling</a> from the <em>Data Science Live Book</em> 📗.</p>
<p>Let's start the example with recipes!</p>
<h3 id="1sthowtocreatearecipe">1st - How to create a <code>recipe</code></h3>
<pre><code class="language-r">library(recipes)
library(tidyverse)

set.seed(3.1415)
iris_tr=sample_frac(iris, size = 0.7)

rec = recipe( ~ ., data = iris_tr)

rec
</code></pre>
<pre><code class="language-r">## Data Recipe
## 
## Inputs:
## 
##       role #variables
##  predictor          5
</code></pre>
<pre><code class="language-r">summary(rec)
</code></pre>
<pre><code class="language-r">## # A tibble: 5 x 4
##   variable     type    role      source  
##   &lt;chr&gt;        &lt;chr&gt;   &lt;chr&gt;     &lt;chr&gt;   
## 1 Sepal.Length numeric predictor original
## 2 Sepal.Width  numeric predictor original
## 3 Petal.Length numeric predictor original
## 4 Petal.Width  numeric predictor original
## 5 Species      nominal predictor original
</code></pre>
<p>The formula <code>~ .</code>,  specifies that all the variables are predictors (with no outcomes).</p>
<p>Please note now we have two different data types, numeric and nominal (not factor nor character).</p>
<h3 id="2ndhowtoaddastep">2nd - How to add a step</h3>
<p>Now we add the step to create the dummy variables, or the <strong>one hot encoding</strong>, which can be seen as the same.</p>
<p>When we do the one hot encoding (<code>one_hot = T</code>), all the levels will be present in the final result. Conversely, when we create the dummy variables, we could have all of the variables, or one less (to avoid the multi-correlation issue).</p>
<pre><code class="language-r">rec_2 = rec %&gt;% step_dummy(Species, one_hot = T)

rec_2
</code></pre>
<pre><code class="language-r">## Data Recipe
## 
## Inputs:
## 
##       role #variables
##  predictor          5
## 
## Operations:
## 
## Dummy variables from Species
</code></pre>
<p>Now we see the dummy step.</p>
<h3 id="3rdhowtodotheprep">3rd - How to do the <code>prep</code></h3>
<p><code>prep</code> is like putting all the ingredients together, <em>but we didn't cook yet!</em></p>
<p>It generates the metadata to do the data preparation.</p>
<p>As we can see here:</p>
<pre><code class="language-r"># Aplico la receta, que tiene 1 step, a los datos
d_prep=rec_2 %&gt;% prep(training = iris_tr, retain = T)

d_prep
</code></pre>
<pre><code class="language-r">## Data Recipe
## 
## Inputs:
## 
##       role #variables
##  predictor          5
## 
## Training data contained 105 data points and no missing data.
## 
## Operations:
## 
## Dummy variables from Species [trained]
</code></pre>
<p>Note we are in the &quot;training&quot; or <em>dev</em> stage. That's why we see the parameter <code>training</code>.</p>
<p>We will see <code>retain = T</code> in the next step.</p>
<p>Checking:</p>
<pre><code class="language-r">summary(d_prep)
</code></pre>
<pre><code class="language-r">## # A tibble: 7 x 4
##   variable           type    role      source  
##   &lt;chr&gt;              &lt;chr&gt;   &lt;chr&gt;     &lt;chr&gt;   
## 1 Sepal.Length       numeric predictor original
## 2 Sepal.Width        numeric predictor original
## 3 Petal.Length       numeric predictor original
## 4 Petal.Width        numeric predictor original
## 5 Species_setosa     numeric predictor derived 
## 6 Species_versicolor numeric predictor derived 
## 7 Species_virginica  numeric predictor derived
</code></pre>
<p><strong>Whoila!</strong> 🎉 We have the 3-new <code>derived</code> columns (one hot), and it removed the original <code>Species</code>.</p>
<h3 id="4thgettingthedatawithjuice">4th - Getting the data with <code>juice</code>!</h3>
<p>Using <code>juice</code> function:</p>
<pre><code class="language-r">d2=juice(d_prep)

head(d2)
</code></pre>
<pre><code class="language-r">## # A tibble: 6 x 7
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species_setosa
##          &lt;dbl&gt;       &lt;dbl&gt;        &lt;dbl&gt;       &lt;dbl&gt;          &lt;dbl&gt;
## 1          5           3            1.6         0.2              1
## 2          6.9         3.2          5.7         2.3              0
## 3          6.3         3.3          4.7         1.6              0
## 4          5.3         3.7          1.5         0.2              1
## 5          6.3         2.3          4.4         1.3              0
## 6          6.7         3            5.2         2.3              0
## # … with 2 more variables: Species_versicolor &lt;dbl&gt;,
## #   Species_virginica &lt;dbl&gt;
</code></pre>
<p><code>juice</code> worked because we <em>retained</em> the training data in the 3rd step (<code>retain = T</code>). Otherwise it would have returned:</p>
<p>⚠️ <em>Error: Use <code>retain = TRUE</code> in <code>prep</code> to be able to extract the training set</em></p>
<h3 id="5thapplythepreptonewdata">5th - Apply the prep to new data</h3>
<p>Now imagine we have <strong>new data</strong> as follows:</p>
<pre><code class="language-r">iris_new=sample_n(iris, size = 5) # taking 5 random rows

d_baked=bake(d_prep, new_data = iris_new)

d_baked
</code></pre>
<pre><code class="language-r">## # A tibble: 5 x 7
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species_setosa
##          &lt;dbl&gt;       &lt;dbl&gt;        &lt;dbl&gt;       &lt;dbl&gt;          &lt;dbl&gt;
## 1          6.4         3.2          4.5         1.5              0
## 2          4.6         3.4          1.4         0.3              1
## 3          5.2         2.7          3.9         1.4              0
## 4          4.8         3.4          1.6         0.2              1
## 5          4.8         3            1.4         0.3              1
## # … with 2 more variables: Species_versicolor &lt;dbl&gt;,
## #   Species_virginica &lt;dbl&gt;
</code></pre>
<p>It worked!</p>
<p><code>bake</code> receives the <code>prep</code> object (<code>d_prep</code>) and it applies to the <code>new_data</code> (<code>iris_new</code>)</p>
<h3 id="whatisthedifferencebetweenbakeandjuice">What is the difference between <code>bake</code> and <code>juice</code>?</h3>
<p>From this perspective given the training data, following data frames are the same:</p>
<pre><code class="language-r">d_tr_1=bake(d_prep, new_data = iris_tr)
d_tr_2=d2=juice(d_prep) # with retain=T

identical(d_tr_1, d_tr_2)
</code></pre>
<pre><code class="language-r">## [1] TRUE
</code></pre>
<h2 id="dealingwithnewvaluesinrecipes">Dealing with new values in recipes</h2>
<p>Simulate a new value:</p>
<pre><code class="language-r">new_row=iris[1,] %&gt;% mutate(Species=as.character(Species))
new_row[1, &quot;Species&quot;]=&quot;i will break your code&quot;

new_row
</code></pre>
<pre><code class="language-r">##   Sepal.Length Sepal.Width Petal.Length Petal.Width                Species
## 1          5.1         3.5          1.4         0.2 i will break your code
</code></pre>
<p>We use <code>bake</code> to convert the new data set:</p>
<pre><code class="language-r">d2_b=bake(d_prep, new_data = new_row)
</code></pre>
<pre><code class="language-r">## Warning: There are new levels in a factor: i will break your code
</code></pre>
<h3 id="thesolutionusestep_novel">The solution! Use <code>step_novel</code></h3>
<p>(Thanks to Max Kuhn)</p>
<p>When we do the <code>prep</code>, we have to add <code>step_novel</code>. So any new value will be assigned to the <code>_new</code> category.</p>
<p>We will start right from the beginning:</p>
<pre><code class="language-r">rec_2_bis = recipe( ~ ., data = iris_tr) %&gt;% 
  step_novel(Species) %&gt;% 
  step_dummy(Species, one_hot = T)

prep_bis = prep(rec_2_bis, training = iris_tr)
</code></pre>
<p>Get to final data, and check it:</p>
<pre><code class="language-r">processed = bake(prep_bis, iris_tr)

funModeling::df_status(processed)
</code></pre>
<pre><code class="language-r">##             variable q_zeros p_zeros q_na p_na q_inf p_inf    type unique
## 1       Sepal.Length       0    0.00    0    0     0     0 numeric     32
## 2        Sepal.Width       0    0.00    0    0     0     0 numeric     23
## 3       Petal.Length       0    0.00    0    0     0     0 numeric     42
## 4        Petal.Width       0    0.00    0    0     0     0 numeric     20
## 5     Species_setosa      68   64.76    0    0     0     0 numeric      2
## 6 Species_versicolor      69   65.71    0    0     0     0 numeric      2
## 7  Species_virginica      73   69.52    0    0     0     0 numeric      2
## 8        Species_new     105  100.00    0    0     0     0 numeric      1
</code></pre>
<p>Please note that <code>Species_new</code> <strong>has been automatically created</strong> (with zeros).</p>
<p>👉 This ensures it <strong>runs well once</strong> in production.</p>
<p>Now let's see what happen when we have the new value:</p>
<pre><code class="language-r">new_row_2=bake(prep_bis, new_data = new_row)

new_row_2 %&gt;% select(Species_new)
</code></pre>
<pre><code class="language-r">## # A tibble: 1 x 1
##   Species_new
##         &lt;dbl&gt;
## 1           1
</code></pre>
<p>It works!</p>
<h2 id="conclusions">Conclusions 💡</h2>
<p>The <code>recipes</code> package seems to be a good way to standardize certain data preparation tasks.<br>
Probably one of the strongest points in R, alongside the <code>dplyr</code> package.</p>
<p>📌 Take care of the <strong>data pipeline</strong>, it is what interviewers will ask you for.</p>
<p>I tried to cover with simple and reproducible examples, many of the situations that happen when we work with productive environments, in the <a href="https://livebook.datascienceheroes.com">Data Science Live Book</a> 📗 (open-source).</p>
<p>Have fun 🚀</p>
<p>📬 You can found me at: <a href="https://www.linkedin.com/in/pcasas/">Linkedin</a> &amp; <a href="https://twitter.com/pabloc_ds">Twitter</a>.</p>
<h3 id="references">References:</h3>
<ul>
<li><a href="https://tidymodels.github.io/recipes/articles/Simple_Example.html">Basic recipes example</a></li>
<li><a href="https://www.benjaminsorensen.me/post/modeling-with-parsnip-and-tidymodels/">Modeling with parsnip and tidymodels</a> by Benjamin Sorensen.</li>
<li><a href="https://www.rstudio.com/resources/webinars/creating-and-preprocessing-a-design-matrix-with-recipes/">Creating and Preprocessing a Design Matrix with Recipes</a> (video)</li>
</ul>
<h3 id="otherpostsyoumightlike">Other posts you might like 🤓...</h3>
<ul>
<li>🔍 <a href="https://blog.datascienceheroes.com/how-to-interpret-shap-values-in-r/">Model interpretability with SHAP</a></li>
<li>📊 <a href="https://blog.datascienceheroes.com/discretization-recursive-gain-ratio-maximization/">Supervized binning</a></li>
</ul>
</div>]]></content:encoded></item><item><title><![CDATA[Jugando con las dimensiones: desde Clustering, PCA, t-SNE.... ¡hasta Carl Sagan!]]></title><description><![CDATA[Explora la intersección de conceptos como reducción de dimensiones, clustering, preparación de datos, PCA, HDBSCAN, k-NN, SOM, deep learning....y Carl Sagan!
]]></description><link>https://blog.datascienceheroes.com/jugando-con-las-dimensiones-desde-clustering-pca-t-sne-hasta-carl-sagan/</link><guid isPermaLink="false">5cf4542342cada04ad16208e</guid><category><![CDATA[clustering]]></category><category><![CDATA[ciencia de datos]]></category><category><![CDATA[tsne]]></category><category><![CDATA[deep-learning]]></category><category><![CDATA[kmeans]]></category><category><![CDATA[r-esp]]></category><dc:creator><![CDATA[Pablo Casas]]></dc:creator><pubDate>Mon, 03 Jun 2019 13:30:13 GMT</pubDate><media:content url="https://blog.datascienceheroes.com/content/images/2019/06/cluster_analysis.png" medium="image"/><content:encoded><![CDATA[<div class="kg-card-markdown"><img src="https://blog.datascienceheroes.com/content/images/2019/06/cluster_analysis.png" alt="Jugando con las dimensiones: desde Clustering, PCA, t-SNE.... ¡hasta Carl Sagan!"><p>👉 <strong>Actualización! 7/4/20</strong> La nueva versión de este post con mejoras y comentarios sobre UMAP, acá: <a href="https://escueladedatosvivos.ai/blog/204650/jugando-con-las-dimensiones-clustering-pca-tsne-carl-sagan">https://escueladedatosvivos.ai/blog/204650/jugando-con-las-dimensiones-clustering-pca-tsne-carl-sagan</a></p>
<h3 id="jugandoconlasdimensiones">Jugando con las dimensiones</h3>
<p>¡Hola! Este post es un experimento que combina el resultado de <strong>t-SNE</strong> con dos técnicas de clustering bien conocidas: <strong>k-means</strong>  y <strong>hierarchical</strong>. Esta será la sección práctica, en <strong>R</strong>.</p>
<p>Pero también, este post explorará el punto de intersección de conceptos como reducción de dimensiones, análisis de clustering, preparación de datos, PCA, HDBSCAN, k-NN, SOM, deep learning....y Carl Sagan!</p>
<h3 id="pcaytsne">PCA y t-SNE</h3>
<p>Para aquellos que no conocen la técnica <strong>t-SNE</strong> (<a href="https://lvdmaaten.github.io/tsne/" target="blank">sitio oficial</a>), es una técnica de proyección -o reducción de dimensiones- similar en algunos aspectos al Análisis de Componentes Principales (PCA), utilizado para visualizar, por ejemplo, N variables en 2.</p>
<p>Cuando la salida de t-SNE es deficiente, Laurens van der Maaten (autor de t-SNE) dice:</p>
<blockquote>
<p>Como prueba de sanidad, intente ejecutar PCA en sus datos para reducirlos a dos dimensiones. Si esto también da malos resultados, entonces tal vez no hay una estructura buena en sus datos en primer lugar. Si PCA funciona bien pero t-SNE no lo hace, estoy bastante seguro de que usted hizo algo mal.</p>
</blockquote>
<p>En mi experiencia, hacer PCA con docenas de variables con:</p>
<ul>
<li>Algunos valores extremos</li>
<li>Distribuciones sesgadas</li>
<li>Varias variables <em>dummy</em> o <em>one-hot</em> (0 ó 1),</li>
</ul>
<p>No conduce a buenas visualizaciones.</p>
<p>Miren este ejemplo comparando los dos métodos:</p>
<img src="https://datascienceheroes.com/img/blog/pca_tsne.png" alt="Jugando con las dimensiones: desde Clustering, PCA, t-SNE.... ¡hasta Carl Sagan!" width="700px">
<p>Fuente: <a href="https://www.kaggle.com/puyokw/digit-recognizer/clustering-in-2-dimension-using-tsne/code" target="blank">Clusterización en 2 dimensiones usando tsne</a></p>
<p>Tiene sentido, ¿no?</p>
<br>
<h3 id="surfeandoendimensionessuperiores">Surfeando en dimensiones superiores 🏄</h3>
<p>Dado que uno de los resultados <strong>t-SNE</strong> es una matriz de dos dimensiones, donde cada punto representa un caso de entrada, podemos aplicar un clustering y luego agrupar los casos de acuerdo a su distancia en este mapa de <strong>2 dimensiones</strong>. Al igual que un mapa geográfico con la cartografía de 3 dimensiones (nuestro mundo), en dos (papel).</p>
<p>El <strong>t-SNE</strong> agrupa casos similares, manejando muy bien las no linearidades de los datos. Después de usar el algoritmo en varios conjuntos de datos, creo que en algunos casos crea algo parecido a <em>formas circulares</em> como islas, donde estos casos son similares.</p>
<p>Sin embargo, no vi este efecto en la demostración interactiva del equipo de Google Brain: <a href="http://distill.pub/2016/misread-tsne/" target="blank">How to Use t-SNE Effectively</a>. Tal vez debido a la naturaleza de los datos de entrada, 2 variables como entrada.</p>
<br> 
<h4 id="losdatosdelrollosuizoswissroll">Los datos del rollo suizo (swiss roll)</h4>
<p>t-SNE de acuerdo a su FAQ no funciona muy bien con los datos de juguete <em>swiss roll</em>. Sin embargo, es un ejemplo impresionante de cómo una superficie tridimensional (o <strong>manifold</strong>) con forma concreta de espiral se despliega como el papel gracias a una técnica de reducción de dimensiones.</p>
<p>La imagen ha sido tomada de <a href="http://axon.cs.byu.edu/papers/gashler2011smc.pdf" target="blank">este paper</a>, donde usaron la técnica de <a href="https://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduction#Manifold_sculpting">&quot;manifold sculpting&quot;</a>.</p>
<img src="https://datascienceheroes.com/img/blog/swiss_roll_manifold_sculpting.png" alt="Jugando con las dimensiones: desde Clustering, PCA, t-SNE.... ¡hasta Carl Sagan!" width="700px">
<br>
<h3 id="ahoralaprcticaenr">Ahora la práctica en R!</h3>
<p><strong>t-SNE</strong> ayuda a hacer que el cluster sea más preciso porque convierte los datos en un espacio de 2 dimensiones donde los puntos están en forma circular (lo que a su vez resulta agradable para el k-means, y es uno de sus puntos débiles a la hora de crear segmentos). Más sobre esto: <a href="http://varianceexplained.org/r/kmeans-free-lunch/" target="blank">K-means clustering is not a free lunch</a>).</p>
<p>Tal como si fuera una <strong>preparación de datos</strong> para aplicar los modelos de clustering.</p>
<pre><code class="language-r">
library(caret)
library(Rtsne)

######################################################################
## The WHOLE post is in: https://github.com/pablo14/post_cluster_tsne
######################################################################

## Download data from: https://github.com/pablo14/post_cluster_tsne/blob/master/data_1.txt (url path inside the gitrepo.)
data_tsne=read.delim(&quot;data_1.txt&quot;, header = T, stringsAsFactors = F, sep = &quot;\t&quot;)

## Rtsne function may take some minutes to complete...
set.seed(9)
tsne_model_1 = Rtsne(as.matrix(data_tsne), check_duplicates=FALSE, pca=TRUE, perplexity=30, theta=0.5, dims=2)

## getting the two dimension matrix
d_tsne_1 = as.data.frame(tsne_model_1$Y)
</code></pre>
<p>Diferentes ejecuciones de <code>Rtsne</code> conducen a diferentes resultados. Por lo tanto, lo más probable es que no se vea exactamente el mismo modelo que el que se presenta aquí.</p>
<p>Según la documentación oficial, la &quot;perplejidad&quot; (<code>perplexity</code>) está relacionada con la importancia de los vecinos:</p>
<ul>
<li><em>Es comparable con el número de vecinos más cercanos k que se emplea en muchos aprendedores de manifold&quot;.</em></li>
<li><em>Los valores típicos para el rango de perplejidad van entre 5 y 50&quot;</em></li>
</ul>
<p>El objeto <code>tsne_model_1$Y</code> contiene las coordenadas X-Y (variables <code>V1</code> y <code>V2</code>), para cada caso de entrada.</p>
<br> 
<p>Graficando los resultados de t-SNE:</p>
<pre><code class="language-r">## plotting the results without clustering
ggplot(d_tsne_1, aes(x=V1, y=V2)) +
  geom_point(size=0.25) +
  guides(colour=guide_legend(override.aes=list(size=6))) +
  xlab(&quot;&quot;) + ylab(&quot;&quot;) +
  ggtitle(&quot;t-SNE&quot;) +
  theme_light(base_size=20) +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank()) +
  scale_colour_brewer(palette = &quot;Set2&quot;)
</code></pre>
<img src="https://datascienceheroes.com/img/blog/tsne_output.png" width="700px" alt="Jugando con las dimensiones: desde Clustering, PCA, t-SNE.... ¡hasta Carl Sagan!">
<br>
<p>Y están las famosas &quot;islas&quot; 🏝️. En este punto, podemos hacer un poco de clustering mirándolo.... Pero probemos k-Means y clustering jerárquico en su lugar 😄. La página de preguntas frecuentes de t-SNE sugiere disminuir el parámetro de perplejidad para evitar esto, sin embargo no encontré ningún problema con este resultado.</p>
<br>
<h4 id="creandolosmodelosdeclsteres">Creando los modelos de clústeres</h4>
<p>La siguiente pieza de código creará los modelos de clúster <strong>k-means</strong> y <strong>jerárquico</strong>. Para entonces asignar el número de cluster (1, 2 ó 3) al que pertenece cada caso de entrada.</p>
<pre><code class="language-r">## keeping original data
d_tsne_1_original=d_tsne_1

## Creating k-means clustering model, and assigning the result to the data used to create the tsne
fit_cluster_kmeans=kmeans(scale(d_tsne_1), 3)
d_tsne_1_original$cl_kmeans = factor(fit_cluster_kmeans$cluster)

## Creating hierarchical cluster model, and assigning the result to the data used to create the tsne
fit_cluster_hierarchical=hclust(dist(scale(d_tsne_1)))

## setting 3 clusters as output
d_tsne_1_original$cl_hierarchical = factor(cutree(fit_cluster_hierarchical, k=3))
</code></pre>
<br>
<h4 id="graficandolosmodelosdeclsteresenlasalidadetsne">Graficando los modelos de clústeres en la salida de t-SNE</h4>
<p>Ahora es el momento de graficar el resultado de cada modelo de clúster, basado en el mapa t-SNE.</p>
<pre><code class="language-r">plot_cluster=function(data, var_cluster, palette)
{
  ggplot(data, aes_string(x=&quot;V1&quot;, y=&quot;V2&quot;, color=var_cluster)) +
  geom_point(size=0.25) +
  guides(colour=guide_legend(override.aes=list(size=6))) +
  xlab(&quot;&quot;) + ylab(&quot;&quot;) +
  ggtitle(&quot;&quot;) +
  theme_light(base_size=20) +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        legend.direction = &quot;horizontal&quot;, 
        legend.position = &quot;bottom&quot;,
        legend.box = &quot;horizontal&quot;) + 
    scale_colour_brewer(palette = palette) 
}


plot_k=plot_cluster(d_tsne_1_original, &quot;cl_kmeans&quot;, &quot;Accent&quot;)
plot_h=plot_cluster(d_tsne_1_original, &quot;cl_hierarchical&quot;, &quot;Set1&quot;)

## and finally: putting the plots side by side with gridExtra lib...
library(gridExtra)
grid.arrange(plot_k, plot_h,  ncol=2)
</code></pre>
<img src="https://datascienceheroes.com/img/blog/tsne_cluster_kmeans_hierarchical.png" width="700px" alt="Jugando con las dimensiones: desde Clustering, PCA, t-SNE.... ¡hasta Carl Sagan!">
<br>
<h3 id="anlisisvisual">Análisis visual</h3>
<p>En este caso, y basado sólo en el análisis visual, lo jerárquico parece tener más <em>sentido común</em> que el k-means. Miren la siguiente imagen:</p>
<img src="https://datascienceheroes.com/img/blog/cluster_analysis.png" width="700px" alt="Jugando con las dimensiones: desde Clustering, PCA, t-SNE.... ¡hasta Carl Sagan!">
<p>Nota: las líneas punteadas que separan los clusters fueron dibujadas a mano.</p>
<br>
<p>En k-means, la distancia en los puntos de la esquina inferior izquierda están bastante cerca en comparación con la distancia de otros puntos dentro del mismo cluster. Pero pertenecen a diferentes grupos. Ilustrándolo:</p>
<img src="https://datascienceheroes.com/img/blog/kmeans_cluster.png" width="450px" alt="Jugando con las dimensiones: desde Clustering, PCA, t-SNE.... ¡hasta Carl Sagan!">
<p>Así que tenemos: la flecha roja es más corta que la azul....</p>
<p>Nota: Diferentes ejecuciones pueden llevar a diferentes agrupaciones, si no ve este efecto en esa parte del mapa, búsquelo en otra.</p>
<p>Este efecto no ocurre en el clustering jerárquico. Los conglomerados con este modelo parecen más uniformes. Pero, ¿qué te parece?</p>
<br>
<h4 id="sesgandoelanlisishaciendotrampa">Sesgando el análisis (haciendo trampa)</h4>
<p>No es justo para k-means que se compare así. El último análisis está basado en la idea de <strong>clustering por densidad</strong>. Esta técnica es realmente genial para superar las trampas de los métodos más simples.</p>
<p><strong>El algoritmo HDBSCAN</strong> basa su proceso en densidades.</p>
<p>Encuentra la esencia de cada uno mirando esta foto:</p>
<img src="https://datascienceheroes.com/img/blog/hdbscan_vs_kmeans.png" alt="Jugando con las dimensiones: desde Clustering, PCA, t-SNE.... ¡hasta Carl Sagan!" width="700px">
<p>Seguramente entendieron la diferencia entre ellos...</p>
<p>La última imagen viene de: <a href="http://nbviewer.jupyter.org/github/lmcinnes/hdbscan/blob/master/notebooks/Comparing%20Clustering%20Algorithms.ipynb" target="blank">Comparing Python Clustering Algorithms</a>. Si, Python, pero es lo mismo para R. El paquete es <a href="https://cran.r-project.org/web/packages/largeVis/vignettes/largeVis.html">largeVis</a>. <em>(Note: Install it by doing: <code>install_github(&quot;elbamos/largeVis&quot;, ref = &quot;release/0.2&quot;)</code></em>.</p>
<br>
<h3 id="deeplearningandtsne">Deep learning and t-SNE</h3>
<p>Citando a Luke Metz desde un gran post (<a href="https://indico.io/blog/visualizing-with-t-sne/" target="blank">Visualizing with t-SNE</a>):</p>
<blockquote>
<p>En los últimos tiempos se ha producido un gran revuelo en torno al término &quot; deep learning &quot;. En la mayoría de las aplicaciones, estos modelos &quot;profundos&quot; pueden reducirse a la composición de funciones simples que se integran de un espacio dimensional alto a otro. A primera vista, estos espacios pueden parecer demasiado grandes para pensar o visualizar, pero técnicas como t-SNE nos permiten empezar a entender lo que está ocurriendo dentro de la caja negra. Ahora, en lugar de tratar estos modelos como cajas negras, podemos empezar a visualizarlos y entenderlos.</p>
</blockquote>
<p>Un comentario <em>profundo</em> 👏.</p>
<h3 id="pensamientosfinales">Pensamientos finales 🚀</h3>
<p>Más allá de este post, <strong>t-SNE</strong> ha demostrado ser una herramienta de propósito general para reducir la dimensionalidad. Puede ser usado para explorar las relaciones dentro de los datos construyendo clusters, o para <a href="https://auth0.com/blog/machine-learning-for-everyone-part-2-abnormal-behavior" target="blank">analizar casos de anomalías </a>, mediante la inspección de los puntos aislados en el mapa.</p>
<p>Jugar con las dimensiones es un concepto clave en la ciencia de datos y en machine leraning. El parámetro de perplejidad es realmente similar al <em>k</em> en el algoritmo del vecino más cercano (<a href="https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm" target="blank">k-NN</a>). ¿Mapear datos en 2 dimensiones y luego hacer clustering? Hmmm eso no es nuevo amigo: <a href="http://www.shanelynn.ie/self-organising-maps-for-customer-segmentation-using-r/" target="blank">Self-Organising Maps for Customer Segmentation</a>.</p>
<p>Cuando seleccionamos las mejores variables para construir un modelo, estamos reduciendo la dimensión de los datos. Cuando construimos un modelo, estamos creando una función que describe las relaciones en los datos.... y así sucesivamente.....</p>
<p>¿Conocías los conceptos generales sobre k-NN y PCA? Bueno, este es un paso más, sólo hay que conectar los cables en el cerebro y ya está. El aprendizaje de conceptos generales nos da la oportunidad de hacer este tipo de asociaciones entre todas estas técnicas. Más allá de la comparación de lenguajes de programación, el poder -en mi opinión- es tener el foco en cómo se comportan los datos, y cómo estas técnicas están y pueden ser conectadas.</p>
<br>
<p>Explora la imaginación con este video de Carl Sagan: Tierra Plana y la 4ª Dimensión. Un cuento sobre la interacción de objetos 3D en un plano 2D....</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/N0WjV6MmCyM" frameborder="0"></iframe>
<br>
<hr>
<p>📌  Continua aprendiendo sobre machine learning!</p>
<p>📗 <strong><a href="https://librovivodecienciadedatos.ai/">Libro Vivo de Ciencia de Datos</a></strong> (open-source) Completamente disponible en línea!</p>
<img src="https://blog.datascienceheroes.com/content/images/2019/06/libro-vivo-de-ciencia-de-datos.png" alt="Jugando con las dimensiones: desde Clustering, PCA, t-SNE.... ¡hasta Carl Sagan!" width="200px">
</div>]]></content:encoded></item><item><title><![CDATA[redshiftTools v1.0.0 - CRAN Release!]]></title><description><![CDATA[<div class="kg-card-markdown"><p>A new version of the package redshiftTools has arrived with improvements and it's now available in <a href="https://cran.r-project.org/web/packages/redshiftTools/index.html">CRAN</a>! This package let's you efficiently upload data into an Amazon Redshift database <a href="https://docs.aws.amazon.com/redshift/latest/dg/t_Loading-data-from-S3.html">using the approach recommended by Amazon</a></p>
<p>This package is helpful because otherwise uploading data with inserts in Redshift is super slow,</p></div>]]></description><link>https://blog.datascienceheroes.com/redshifttools-v1-0-0-cran-release/</link><guid isPermaLink="false">5cddd28d42cada04ad162080</guid><category><![CDATA[R]]></category><category><![CDATA[rstats]]></category><category><![CDATA[amazon-redshift]]></category><dc:creator><![CDATA[Pablo Seibelt]]></dc:creator><pubDate>Fri, 17 May 2019 15:00:00 GMT</pubDate><media:content url="https://blog.datascienceheroes.com/content/images/2019/05/logo.png" medium="image"/><content:encoded><![CDATA[<div class="kg-card-markdown"><img src="https://blog.datascienceheroes.com/content/images/2019/05/logo.png" alt="redshiftTools v1.0.0 - CRAN Release!"><p>A new version of the package redshiftTools has arrived with improvements and it's now available in <a href="https://cran.r-project.org/web/packages/redshiftTools/index.html">CRAN</a>! This package let's you efficiently upload data into an Amazon Redshift database <a href="https://docs.aws.amazon.com/redshift/latest/dg/t_Loading-data-from-S3.html">using the approach recommended by Amazon</a></p>
<p>This package is helpful because otherwise uploading data with inserts in Redshift is super slow, this is the recommended way of doing replaces and upserts per the Redshift documentation, which consists of generating various CSV files, uploading them to an S3 bucket and then calling a copy command on the Redshift server, all of that is handled by the package.</p>
<p>To install this package, use the following command:</p>
<pre><code>install.packages('redshiftTools')
</code></pre>
<p>After installing, you'll have these functions to use, which are explained in full detail in the package's man pages.</p>
<p><em>rs_create_statement</em>: Generates the SQL statement to create a table based on the structure of a data.frame. It allows you to specify sort key, dist key and if you want to allow compression to be added or not.</p>
<p><em>rs_replace_table</em>: Deletes all records in a table, then uploads the provided data frame into it. It runs as a transaction so the table is never empty to the other users.</p>
<p><em>rs_upsert_table</em>: Deletes all records matching the provided keys from the uploaded dataset, and then inserts the rows from the dataset. If no keys are provided, it acts as a regular insert.</p>
<p><em>rs_cols_upsert_table</em>: Like rs_upsert_table but can choose only some columns to update</p>
<p><em>rs_append_table</em>: Like the previous functions but only appends data without altering existing data.</p>
<p><em>rs_create_table</em>: This just runs rs_create_statement and then rs_replace_table, creating a table with the same structure as your data frame and then uploading the data frame to it.</p>
<p>For more details, read the official README in <a href="https://github.com/sicarul/redshiftTools">https://github.com/sicarul/redshiftTools</a></p>
<p>A special thanks to all the collaborators that sent contributions to the package:</p>
<ul>
<li><a href="https://github.com/kwent">Quentin Rousseau - kwent</a></li>
<li><a href="https://github.com/rtjohn">Ryan Johnson - rtjohn</a></li>
<li><a href="https://github.com/ilyaminati">Ilya Goldin - ilyaminati</a></li>
<li><a href="https://github.com/mfarkhann">Farkhan Novianto - mfarkhann</a></li>
<li><a href="https://github.com/Emelieh21">Emelie Hofland - Emelieh21</a></li>
</ul>
<h2 id="futureplans">Future Plans</h2>
<p>For future versions, i plan to include additional utility functions that allow you to obtain table metadata, optimize table encoding, check table permissions, etc. If you feel like you have some cool functionality to share please share your pull request!</p>
</div>]]></content:encoded></item><item><title><![CDATA[Lanzamiento! Libro Vivo de Ciencia de Datos 📗 (open-source)]]></title><description><![CDATA[Finalmente disponible la versión en español del _Data Science Live Book_! El libro se abre sin barreras idiomáticas ante las personas de habla-hispana con ganas de aprender 👨‍🎓👩‍🎓.

Esta publicación es una edición revisada tanto en gramática como en aspectos técnicos de la versión en inglés.]]></description><link>https://blog.datascienceheroes.com/lanzamiento-libro-vivo-de-ciencia-de-datos-open-source/</link><guid isPermaLink="false">5ca6207542cada04ad16205d</guid><category><![CDATA[libro-vivo-ciencia-datos]]></category><category><![CDATA[ebook]]></category><category><![CDATA[r-esp]]></category><dc:creator><![CDATA[Pablo Casas]]></dc:creator><pubDate>Thu, 04 Apr 2019 16:51:23 GMT</pubDate><media:content url="https://blog.datascienceheroes.com/content/images/2019/04/Screen-Shot-2019-04-04-at-13.46.11.png" medium="image"/><content:encoded><![CDATA[<div class="kg-card-markdown"><img src="https://blog.datascienceheroes.com/content/images/2019/04/Screen-Shot-2019-04-04-at-13.46.11.png" alt="Lanzamiento! Libro Vivo de Ciencia de Datos 📗 (open-source)"><p>Finalmente disponible la versión en español del <em>Data Science Live Book</em>! El libro se abre sin barreras idiomáticas ante las personas de habla-hispana con ganas de aprender 👨‍🎓👩‍🎓.</p>
<p>Esta publicación es una edición revisada tanto en gramática como en aspectos técnicos de la versión en inglés. Pueden acceder a la versión on-line, completa en:</p>
<h3 id="librovivodecienciadedatosai">👉 <strong><a href="https://LibroVivoDeCienciaDeDatos.ai">LibroVivoDeCienciaDeDatos.ai</a></strong> 🚀</h3>
<img src="https://librovivodecienciadedatos.ai/introduction/libro-vivo-de-ciencia-de-datos.png" alt="Lanzamiento! Libro Vivo de Ciencia de Datos 📗 (open-source)" width="400px">
<br>
<p>El <em>Data Science Live Book</em>, junto con dos artículos de como auto-publicar un libro usando bookdown, fueron premiados por RStudio en el <a href="https://community.rstudio.com/t/announcing-winners-of-the-1st-bookdown-contest/16394">1st Bookdown Contest</a>.</p>
<h4 id="porqupublicarenespaolsiyaesteningls">¿Por qué publicar en español si ya está en inglés?</h4>
<p>Recuerdo cuando comencé a estudiar Data Science, (o Data Mining, como se decía en aquel entonces), me costaba bastante más entender los conceptos técnicos al mismo tiempo que traducía, o bien buscaba las palabras en un diccionario.</p>
<p>Si bien para estar en este mundo de datos, hace falta leer en ingles, esta versión busca acercar la ciencia de datos a las personas que todavía no están cómodas leyendo en otro idioma.</p>
<p>Por suerte cada vez hay mas recursos en español para aprender. Espero que el libro motive a que se siga escribiendo en otros idiomas.</p>
<h4 id="libroopensource">Libro open-source</h4>
<p>La versión en español sigue siendo open-source, acá su repositorio en <a href="https://github.com/pablo14/libro-vivo-ciencia-datos">Github</a> por si quieren hacer sugerencias o detectan bugs que escribi silenciosamente.</p>
<img src="https://blog.datascienceheroes.com/content/images/2019/04/coyote.gif" width="300px" alt="Lanzamiento! Libro Vivo de Ciencia de Datos 📗 (open-source)">
<p><em>Licencia: Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)</em></p>
<h4 id="vaasalirunaversinenpapelimpreso">¿Va a salir una versión en papel impreso?</h4>
<p>Si, durante las próximas semanas, como la versión en inglés.</p>
<h4 id="cmodescargoellibro">¿Cómo descargo el libro?</h4>
<p>Si les gusta y quieren apoyar el proyecto (ademas de ayudar a cubrir algunos gastos de publicación), lo pueden descargar bajo la filosofía de <em>&quot;name a fair price&quot;</em> (escriba un precio justo), con un piso de US$ 5:</p>
<p>Página de <strong><a href="https://librovivodecienciadedatos.ai/descargar-libro.html">descarga del libro</a></strong> 📥.</p>
<br>
<img src="https://blog.datascienceheroes.com/content/images/2019/04/pink-panther.gif" width="300px" alt="Lanzamiento! Libro Vivo de Ciencia de Datos 📗 (open-source)">
<br>
<p>Gracias a los que colaboraron haciendo alguna sugerencia: Alain Rodriguez, Andrew White, Chip Oglesby, Federico Molina, Federico Otero, Jonas Ertel, Lucas Crespo, Pablo Seibelt, Stuart Hertzog, Holger K. von Jouanne-Diedrich, Bernardo Lares, Kevin Hammond, Sebastian Varela, Damian Covalski.</p>
<hr>
<p>Gracias por leer, y espero que el libro les sea útil! 🚀</p>
<p>Quedense en contacto en <a href="https://twitter.com/pabloc_ds">Twitter</a> y <a href="https://www.linkedin.com/in/pcasas/">Linkedin</a>.</p>
</div>]]></content:encoded></item><item><title><![CDATA[A gentle introduction to SHAP values in R]]></title><description><![CDATA[Opening the black-box in complex models: SHAP values.
What are they and how to draw conclusions from them?
With R code example!]]></description><link>https://blog.datascienceheroes.com/how-to-interpret-shap-values-in-r/</link><guid isPermaLink="false">5c8fb45c3f68cc3d8005d4d7</guid><category><![CDATA[shap]]></category><category><![CDATA[feature-selection]]></category><category><![CDATA[machine learning]]></category><category><![CDATA[importance variable]]></category><category><![CDATA[model performance]]></category><category><![CDATA[R]]></category><dc:creator><![CDATA[Pablo Casas]]></dc:creator><pubDate>Mon, 18 Mar 2019 15:23:34 GMT</pubDate><media:content url="https://blog.datascienceheroes.com/content/images/2019/03/shap_summary_heart_disease-1.png" medium="image"/><content:encoded><![CDATA[<div class="kg-card-markdown"><img src="https://blog.datascienceheroes.com/content/images/2019/03/shap_summary_heart_disease-1.png" alt="A gentle introduction to SHAP values in R"><p>Hi there! During the first meetup of <a href="https://argentinar.org/">argentinaR.org</a> -an R user group- <a href="https://www.linkedin.com/in/danielquelali/">Daniel Quelali</a> introduced us to a new model validation technique called <strong>SHAP values</strong>.</p>
<p>This novel approach allows us to dig a little bit more in the complexity of the predictive model results, while it allows us to explore the relationships between variables for predicted case.</p>
<img src="https://blog.datascienceheroes.com/content/images/2019/03/simpsons.gif" width="300px" alt="A gentle introduction to SHAP values in R">
<p>I've been using this it with &quot;real&quot; data, cross-validating the results, and let me tell you it works.<br>
This post is a gentle introduction to it, hope you enjoy it!</p>
<p><em>Find me on <a href="https://twitter.com/pabloc_ds">Twitter</a> and <a href="https://www.linkedin.com/in/pcasas/">Linkedin</a>.</em></p>
<p><strong>Clone <a href="https://github.com/pablo14/shap-values">this github repository</a></strong> to reproduce the plots.</p>
<h2 id="introduction">Introduction</h2>
<p>Complex predictive models are not easy to interpret. By complex I mean: random forest, xgboost, deep learning, etc.</p>
<p>In other words, given a certain prediction, like having a <em>likelihood of buying= 90%</em>, what was the influence of each input variable in order to get that score?</p>
<p>A recent technique to interpret black-box models has stood out among others: <a href="https://github.com/slundberg/shap">SHAP</a> (<strong>SH</strong>apley <strong>A</strong>dditive ex<strong>P</strong>lanations) developed by Scott M. Lundberg.</p>
<p>Imagine a sales score model. A customer living in zip code &quot;A1&quot; with &quot;10 purchases&quot; arrives and its score is 95%, while other from zip code &quot;A2&quot; and &quot;7 purchases&quot; has a score of 60%.</p>
<p>Each variable had its contribution to the final score. Maybe a slight change in the number of purchases changes the score <em>a lot</em>, while changing the zip code only contributes a tiny amount on that specific customer.</p>
<p>SHAP measures the impact of variables taking into account the interaction with other variables.</p>
<blockquote>
<p>Shapley values calculate the importance of a feature by comparing what a model predicts with and without the feature. However, since the order in which a model sees features can affect its predictions, this is done in every possible order, so that the features are fairly compared.</p>
</blockquote>
<p><a href="https://medium.com/@gabrieltseng/interpreting-complex-models-with-shap-values-1c187db6ec83">Source</a></p>
<h2 id="shapvaluesindata">SHAP values in data</h2>
<p>If the original data has 200 rows and 10 variables, the shap value table will <strong>have the same dimension</strong> (200 x 10).</p>
<p>The original values from the input data are replaced by its SHAP values. However it is not the same replacement for all the columns. Maybe a value of <code>10 purchases</code> is replaced by the value <code>0.3</code> in customer 1, but in customer 2 it is replaced by <code>0.6</code>. This change is due to how the variable for that customer interacts with other variables. Variables work in groups and describe a whole.</p>
<p>Shap values can be obtained by doing:</p>
<p><code>shap_values=predict(xgboost_model, input_data, predcontrib = TRUE, approxcontrib = F)</code></p>
<h2 id="exampleinr">Example in R</h2>
<p>After creating an xgboost model, we can plot the shap summary for a rental bike dataset. The target variable is the count of rents for that particular day.</p>
<p>Function <code>plot.shap.summary</code> (from the <a href="https://github.com/pablo14/shap-values">github repo</a>) gives us:</p>
<img src="https://blog.datascienceheroes.com/content/images/2019/03/shap_summary_bike.png" alt="A gentle introduction to SHAP values in R" width="600px">
<h3 id="howtointerprettheshapsummaryplot">How to interpret the shap summary plot?</h3>
<ul>
<li>The y-axis indicates the variable name, in order of importance from top to bottom. The value next to them is the mean SHAP value.</li>
<li>On the x-axis is the SHAP value. Indicates how much is the change in log-odds. From this number we can extract the probability of success.</li>
<li>Gradient color indicates the original value for that variable. In booleans, it will take two colors, but in number it can contain the whole spectrum.</li>
<li>Each point represents a row from the original dataset.</li>
</ul>
<p>Going back to the bike dataset, most of the variables are boolean.</p>
<p>We can see that having a high humidity is associated with <strong>high and negative</strong> values on the target. Where <em>high</em> comes from the color and <em>negative</em> from the x value.</p>
<p>In other words, people rent fewer bikes if humidity is high.</p>
<p>When <code>season.WINTER</code> is high (or true) then shap value is high. People rent more bikes in winter, this is nice since it sounds counter-intuitive. Note the point dispersion in <code>season.WINTER</code> is less than in <code>hum</code>.</p>
<p>Doing a simple violin plot for variable <code>season</code> confirms the pattern:</p>
<img src="https://blog.datascienceheroes.com/content/images/2019/03/bike_season.png" alt="A gentle introduction to SHAP values in R" width="500px">
<p>As expected, rainy, snowy or stormy days are associated with less renting. However, if the value is <code>0</code>, it doesn't affect much the bike renting. Look at the yellow points around the 0 value. We can check the original variable and see the difference:</p>
<img src="https://blog.datascienceheroes.com/content/images/2019/03/bike_warhersit.png" alt="A gentle introduction to SHAP values in R" width="500px"> 
<p>What conclusion can you draw by looking at variables <code>weekday.SAT</code> and <code>weekday.MON</code>?</p>
<h3 id="shapsummaryfromxgboostpackage">Shap summary from xgboost package</h3>
<p>Function <code>xgb.plot.shap</code> from xgboost package provides these plots:</p>
<img src="https://blog.datascienceheroes.com/content/images/2019/03/shap_value_all.png" alt="A gentle introduction to SHAP values in R" width="600px">
<ul>
<li>y-axis: shap value.</li>
<li>x-axis: original variable value.</li>
</ul>
<p>Each blue dot is a row (a <em>day</em> in this case).</p>
<p>Looking at <code>temp</code> variable, we can see how lower temperatures are associated with a big decrease in shap values. Interesting to note that around the value 22-23 the curve starts to decrease again. A perfect non-linear relationship.</p>
<p>Taking <code>mnth.SEP</code> we can observe that dispersion around 0 is almost 0, while on the other hand, the value 1 is associated mainly with a shap increase around 200, but it also has certain days where it can push the shap value to more than 400.</p>
<p><code>mnth.SEP</code> is a good case of <strong>interaction</strong> with other variables, since in presence of the same value (<code>1</code>), the shap value can differ a lot. What are the effects with other variables that explain this variance in the output? A topic for another post.</p>
<h2 id="rpackageswithshap">R packages with SHAP</h2>
<p><strong><a href="https://cran.r-project.org/web/packages/iml/vignettes/intro.html">Interpretable Machine Learning</a></strong> by Christoph Molnar.</p>
<img src="https://blog.datascienceheroes.com/content/images/2019/03/iml_shap_R_package.png" alt="A gentle introduction to SHAP values in R" width="500px">
<p><strong><a href="http://smarterpoland.pl/index.php/2019/03/shapper-is-on-cran-its-an-r-wrapper-over-shap-explainer-for-black-box-models/">shapper</a></strong></p>
<p>A Python wrapper:</p>
<img src="https://blog.datascienceheroes.com/content/images/2019/03/shapper.png" alt="A gentle introduction to SHAP values in R" width="500px">
<p><strong><a href="https://medium.com/applied-data-science/new-r-package-the-xgboost-explainer-51dd7d1aa211">xgboostExplainer</a></strong></p>
<p>Altough it's not SHAP, the idea is really similar. It calculates the contribution for each value in every case, by accessing at the trees structure used in model.</p>
<img src="https://blog.datascienceheroes.com/content/images/2019/03/xgboostExplainer.png" alt="A gentle introduction to SHAP values in R" width="500px">
<h2 id="recommendedliteratureaboutshapvalues">Recommended literature about SHAP values 📚</h2>
<p>There is a vast literature around this technique, check the online book <em>Interpretable Machine Learning</em> by Christoph Molnar. It addresses in a nicely way <a href="https://christophm.github.io/interpretable-ml-book/agnostic.html">Model-Agnostic Methods</a> and one of its particular cases <a href="https://christophm.github.io/interpretable-ml-book/shapley.html">Shapley values</a>. An outstanding work.</p>
<p>From classical variable, ranking approaches like <em>weight</em> and <em>gain</em>, to shap values: <a href="https://towardsdatascience.com/interpretable-machine-learning-with-xgboost-9ec80d148d27">Interpretable Machine Learning with XGBoost</a> by Scott Lundberg.</p>
<p>A permutation perspective with examples: <a href="https://towardsdatascience.com/one-feature-attribution-method-to-supposedly-rule-them-all-shapley-values-f3e04534983d">One Feature Attribution Method to (Supposedly) Rule Them All: Shapley Values</a>.</p>
<p>--</p>
<p>If you have any questions, leave it below :)</p>
<p>Thanks for reading! 🚀</p>
<p>Other readings you might like:</p>
<ul>
<li><a href="https://blog.datascienceheroes.com/discretization-recursive-gain-ratio-maximization/">New discretization method: Recursive information gain ratio maximization</a></li>
<li><a href="https://blog.datascienceheroes.com/feature-selection-using-genetic-algorithms-in-r/">Feature Selection using Genetic Algorithms in R</a></li>
<li>📗<a href="http://livebook.datascienceheroes.com/">Data Science Live Book</a></li>
</ul>
<p><a href="https://twitter.com/pabloc_ds">Twitter</a> and <a href="https://www.linkedin.com/in/pcasas/">Linkedin</a>.</p>
</div>]]></content:encoded></item><item><title><![CDATA[New discretization method: Recursive information gain ratio maximization]]></title><description><![CDATA[This method can discretize a variable taking into consideration the target variable, similar to what decision tree do but with gain ratio. ]]></description><link>https://blog.datascienceheroes.com/discretization-recursive-gain-ratio-maximization/</link><guid isPermaLink="false">5c64767f3f68cc3d8005d4c9</guid><category><![CDATA[data preparation]]></category><category><![CDATA[machine learning]]></category><category><![CDATA[R]]></category><category><![CDATA[funModeling]]></category><dc:creator><![CDATA[Pablo Casas]]></dc:creator><pubDate>Wed, 13 Feb 2019 20:10:24 GMT</pubDate><media:content url="https://blog.datascienceheroes.com/content/images/2019/02/discretize_rgr-1.gif" medium="image"/><content:encoded><![CDATA[<div class="kg-card-markdown"><img src="https://blog.datascienceheroes.com/content/images/2019/02/discretize_rgr-1.gif" alt="New discretization method: Recursive information gain ratio maximization"><p>Hello everyone, I'm happy to share a new method to discretize variables I was working on for the last few months:</p>
<p><strong>Recursive discretization using gain ratio for multi-class variable</strong></p>
<p>tl;dr: <code>funModeling::discretize_rgr(input, target)</code></p>
<p>The problem: Need to convert a numeric variable into one categorical, considering the relationship with the target variable.</p>
<p>How do we choose the split points for each segment? The selection can improve or worsen the relationship.</p>
<h2 id="example">Example</h2>
<pre><code class="language-r"># Available from version 1.7 (2019-02-13), please update it before proceeding:
# install.packages(&quot;funModeling&quot;) 
library(funModeling)
library(dplyr)

heart_disease$oldpeak_2 = discretize_rgr(input=heart_disease$oldpeak, target=heart_disease$has_heart_disease)
</code></pre>
<p>Check the results:</p>
<p>Before and after the transformation</p>
<pre><code class="language-r">head(select(heart_disease, oldpeak, oldpeak_2))
</code></pre>
<pre><code class="language-r">##   oldpeak oldpeak_2
## 1     2.3 [1.9,6.2]
## 2     1.5 [1.4,1.9)
## 3     2.6 [1.9,6.2]
## 4     3.5 [1.9,6.2]
## 5     1.4 [1.4,1.9)
## 6     0.8 [0.6,1.0)
</code></pre>
<p>Checking the distribution</p>
<pre><code class="language-r">summary(heart_disease$oldpeak_2)
</code></pre>
<pre><code class="language-r">## [0.0,0.6) [0.6,1.0) [1.0,1.4) [1.4,1.9) [1.9,6.2] 
##       135        31        34        39        64
</code></pre>
<p>Plotting</p>
<pre><code class="language-r">cross_plot(heart_disease, input = &quot;oldpeak_2&quot;, target = &quot;has_heart_disease&quot;)
</code></pre>
<img src="https://blog.datascienceheroes.com/content/images/2019/02/cross_plot_discretize.png" width="500px" alt="New discretization method: Recursive information gain ratio maximization">
<p>Left: accuracy, right: representativeness (sample size).</p>
<p>More info about <code>cross_plot</code> <a href="https://livebook.datascienceheroes.com/selecting-best-variables.html#profiling_target_cross_plot">here</a>.</p>
<h2 id="parameters">Parameters</h2>
<ul>
<li><code>min_perc_bins</code>: Controls the minimum sample size per bin, <code>0.1</code> or 10% as default.</li>
<li><code>max_n_bins</code>: Maximum number of bins to split the input variable, <code>5</code> bins as default.</li>
</ul>
<p>Both parameters are related, in the sense that setting a higher number in <code>min_perc_bins</code> may not satisfy the number of desired bins (<code>max_n_bins</code>).</p>
<h2 id="littlebenchmark">Little benchmark</h2>
<p>Next image shows ROC metrics for two models, one with the original variable and another with the discretized variable. In this case, the discretization improves ROC value, but decreases the specificity.</p>
<img src="https://blog.datascienceheroes.com/content/images/2019/02/bench3.png" width="400px" alt="New discretization method: Recursive information gain ratio maximization">
<h2 id="otherscenarios">Other scenarios</h2>
<h3 id="case1missingvaluesinnumericvariables">Case 1: Missing values in numeric variables.</h3>
<p>In this case the way we discretize a variable weight more heavily. One data preparation trick is to convert it to categorical, when one category is <code>&quot;NA&quot;</code> and the remaining categories are the bins calculated by the algorithm. <code>funModeling</code> <a href="https://livebook.datascienceheroes.com/appendix.html#data-preparation">supports this scenario</a> for equal frequency discretization, and will do the same for <code>discretize_rgr</code>.</p>
<h3 id="case2exploratorydataanalysis">Case 2: Exploratory data analysis</h3>
<p>From the discretization, we can semantically describe the relationship between the input and the target variable. Finding the segments that maximizes the likelihood might be quite helpful to report in our job or research.</p>
<h2 id="aboutthemethod">About the method</h2>
<ul>
<li>It keeps a minimum sample size per segment (representativity), thanks to <code>min_perc_bins</code></li>
<li>It uses the <strong>gain ratio</strong> metric to calculate the best split point that maximizes the target variable likelihood (accuracy).</li>
</ul>
<p>The control of minimum sample size helps to avoid bias in segments with low representativity.</p>
<p>Gain ratio is an improvement over information gain, commonly used in decision trees, since it penalizes variables with high cardinality (like zip code).</p>
<p>The method find the best cut point based on a list of possible candidates. Each candidate is calculated based on the percentiles. Once it finds a point that maximizes gain ratio while at the same time, satisfy the condition of minimum sample size, it creates two search branches considering all the rows above and below the cutpoint, the <em>left</em> and the <em>right</em> respectevelly.</p>
<p>Now again, for each branch the algorithm finds the best point, for that subset of rows, and the process repeats recursivelly until satisfy the stopping criteria.</p>
<h2 id="learnmore">Learn more</h2>
<p>The <em>Data Science Live Book</em> covers some points related to this method:</p>
<ul>
<li><a href="https://livebook.datascienceheroes.com/data-preparation.html#discretizing_numerical_variables">Discretizing numerical variables</a>.</li>
<li>Sample size and accuracy trade-off, in the case of <a href="https://livebook.datascienceheroes.com/data-preparation.html#analysis-for-predictive-modeling">treating high-cardinality variables</a>.</li>
</ul>
<p>Want to grasp more about the information theory world? <a href="http://kevinmeurer.com/a-simple-guide-to-entropy-based-discretization/">A Simple Guide to Entropy-Based Discretization</a> by Kevin Meurer.</p>
<p>Leave in the comments any doubt ;)</p>
<hr>
<p>Thanks for reading 🚀</p>
<p>Find me on <a href="https://twitter.com/pabloc_ds">Twitter</a> and <a href="https://www.linkedin.com/in/pcasas/">Linkedin</a>.</p>
<p>Want to learn more? 📗 <a href="http://livebook.datascienceheroes.com/">Data Science Live Book</a></p>
</div>]]></content:encoded></item><item><title><![CDATA[Feature Selection using Genetic Algorithms in R]]></title><description><![CDATA[From a gentle introduction to a practical solution, this is a post about feature selection using genetic algorithms in R.
]]></description><link>https://blog.datascienceheroes.com/feature-selection-using-genetic-algorithms-in-r/</link><guid isPermaLink="false">5c3788753f68cc3d8005d4b3</guid><category><![CDATA[R]]></category><category><![CDATA[genetic-algorithms]]></category><category><![CDATA[feature-selection]]></category><category><![CDATA[GA]]></category><dc:creator><![CDATA[Pablo Casas]]></dc:creator><pubDate>Tue, 15 Jan 2019 14:10:05 GMT</pubDate><media:content url="https://blog.datascienceheroes.com/content/images/2019/01/evolutionary_algortihm-1.gif" medium="image"/><content:encoded><![CDATA[<div class="kg-card-markdown"><img src="https://blog.datascienceheroes.com/content/images/2019/01/evolutionary_algortihm-1.gif" alt="Feature Selection using Genetic Algorithms in R"><p>This is a post about feature selection using genetic algorithms in R, in which we will do a quick review about:</p>
<ul>
<li>What are genetic algorithms?</li>
<li>GA in ML?</li>
<li>What does a solution look like?</li>
<li>GA process and its operators</li>
<li>The fitness function</li>
<li>Genetics Algorithms in R!</li>
<li>Try it yourself</li>
<li>Relating concepts</li>
</ul>
<p><em>Animation source: &quot;Flexible Muscle-Based Locomotion for Bipedal Creatures&quot; - Thomas Geijtenbeek</em></p>
<h2 id="theintuitionbehind">The intuition behind</h2>
<p>Imagine a black box which can help us to decide over an <strong>unlimited number of possibilities</strong>, with a criterion such that we can find an acceptable solution (both in time and quality) to a problem that we formulate.</p>
<h2 id="whataregeneticalgorithms">What are genetic algorithms?</h2>
<p>Genetic Algortithms (GA) are a mathematical model inspired by the famous Charles Darwin's idea of <em>natural selection</em>.</p>
<p>The natural selection preserves only the fittest individuals, over the different generations.</p>
<p>Imagine a population of 100 rabbits in 1900, if we look the population today, we are going to others rabbits more fast and skillful to find food than their ancestors.</p>
<h2 id="gainml">GA in ML</h2>
<p>In <strong>machine learning</strong>, one of the uses of genetic algorithms is to pick up the right number of variables in order to create a predictive model.</p>
<p>To pick up the right subset of variables is a problem of <strong>combinatory and optimization</strong>.</p>
<p>The advantage of this technique over others is, it allows the best solution to emerge from the best of prior solutions. An evolutionary algorithm which improves the selection over time.</p>
<p>The idea of GA is to combine the different solutions <strong>generation after generation</strong> to extract the best <em>genes</em> (variables) from each one. That way it creates new and more fitted individuals.</p>
<p>We can find other uses of GA such as hyper-tunning parameter, find the maximum (or min) of a function or the search for a correct neural network arquitecture (Neuroevolution), or among others...</p>
<h2 id="gainfeatureselection">GA in feature selection</h2>
<p>Every possible solution of the GA, which are the selected variables (a <em>single</em> 🐇), are <strong>considered as a whole</strong>, it will not rank variables individually against the target.</p>
<p>And this is important because we already know that <a href="https://livebook.datascienceheroes.com/selecting-best-variables.html#variables-work-in-groups">variables work in group</a>.</p>
<h2 id="whatdoesasolutionlooklike">What does a solution look like?</h2>
<p>Keeping it simple for the example, imagine we have a total of 6 variables,</p>
<p>One solution can be picking up 3 variables, let's say: <code>var2</code>, <code>var4</code> and <code>var5</code>.</p>
<p>Another solution can be: <code>var1</code> and <code>var5</code>.</p>
<p>These solutions are the so-called <strong>individuals</strong> or <strong>chromosomes</strong> in a population. They are possible solutions to our problem.</p>
<img src="https://blog.datascienceheroes.com/content/images/2019/01/chromose-population-gene.png" alt="Feature Selection using Genetic Algorithms in R" width="300px">
<p><em>Credit image: Vijini Mallawaarachchi</em></p>
<p>From the image, the solution 3 can be expressed as a one-hot vector: <code>c(1,0,1,0,1,1)</code>. Each <code>1</code> indicates the solution containg that variable. In this case: <code>var1</code>, <code>var3</code>, <code>var5</code>, <code>var6</code>.</p>
<p>While the solution 4 is: <code>c(1,1,0,1,1,0)</code>.</p>
<p>Each position in the vector is a <strong>gene</strong>.</p>
<h2 id="gaprocessanditsoperators">GA process and its operators</h2>
<img src="https://blog.datascienceheroes.com/content/images/2019/01/genetics_algortihms_workflow-1.png" alt="Feature Selection using Genetic Algorithms in R" width="600px">
<p>The underlying idea of a GA is to generate some random possible solutions (called <code>population</code>), which represent different variables, to then combine the best solutions in an iterative process.</p>
<p>This combination follows the basic GA operations, which are: selection, mutation and cross-over.</p>
<ul>
<li><strong>Selection</strong>: Pick up the most fitted individuals in a generation (i.e.: the solutions providing the highest ROC).</li>
<li><strong>Cross-over</strong>: Create 2 new individuals, based on the genes of two solutions. These children will appear to the next generation.</li>
<li><strong>Mutation</strong>: Change a gene randomly in the individual (i.e.: flip a <code>0</code> to <code>1</code>)</li>
</ul>
<p>The idea is for each generation, we will find better individuals, like a fast rabbit.</p>
<p>I recommend the <a href="https://towardsdatascience.com/introduction-to-genetic-algorithms-including-example-code-e396e98d8bf3">post of Vijini Mallawaarachchi</a> about how a genetic algorithm works.</p>
<p>These basic operations allow the algorithm to change the possible solutions by combining them in a way that maximizes the objective.</p>
<h2 id="thefitnessfunction">The fitness function</h2>
<p>This objective maximization is, for example, to keep with the solution that maximizes the area under the ROC curve. This is defined in the <em>fitness function</em>.</p>
<p>The fitness function takes a possible solution (or chromosome, if you want to sound more sophisticated), and <em>somehow</em> evaluates the effectiveness of the selection.</p>
<p>Normally, the fitness function takes the one-hot vector <code>c(1,1,0,0,0,0)</code>, creates, for example, a random forest model with <code>var1</code> and <code>var2</code>, and returns the fitness value (ROC).</p>
<p>The fitness value in this code calculates is: <code>ROC value / number of variables</code>. By doing this the algorithm penalizes the solutions with a large number of variables. Similar to the idea of <a href="https://en.wikipedia.org/wiki/Akaike_information_criterion">Akaike information criterion</a>, or AIC.</p>
<img src="https://blog.datascienceheroes.com/content/images/2019/01/homer_ga.png" width="500px" alt="Feature Selection using Genetic Algorithms in R">
<h2 id="geneticsalgorithmsinr">Genetics Algorithms in R! 🐛</h2>
<p>My intention is to provide you with a clean code so you can understand what's behind, while at the same time, try new approaches like modifying the fitness function. This is a crucial point.</p>
<p>To use on your own data set, make sure <code>data_x</code> (data frame) and <code>data_y</code> (factor) are compatible with the <code>custom_fitness</code> function.</p>
<p>The main library is <code>GA</code>, developed by Luca Scrucca. See <a href="https://cran.r-project.org/web/packages/GA/vignettes/GA.html">here</a> the vignette with examples.</p>
<p>📣 <strong>Important</strong>: The following code is incomplete. <strong><a href="https://github.com/pablo14/genetic-algorithm-feature-selection">Clone the repository</a></strong> to run the example.</p>
<pre><code class="language-r"># data_x: input data frame
# data_y: target variable (factor)

# GA parameters
param_nBits=ncol(data_x)
col_names=colnames(data_x)

# Executing the GA 
ga_GA_1 = ga(fitness = function(vars) custom_fitness(vars = vars, 
                                                     data_x =  data_x, 
                                                     data_y = data_y, 
                                                     p_sampling = 0.7), # custom fitness function
             type = &quot;binary&quot;, # optimization data type
             crossover=gabin_uCrossover,  # cross-over method
             elitism = 3, # best N indiv. to pass to next iteration
             pmutation = 0.03, # mutation rate prob
             popSize = 50, # the number of indivduals/solutions
             nBits = param_nBits, # total number of variables
             names=col_names, # variable name
             run=5, # max iter without improvement (stopping criteria)
             maxiter = 50, # total runs or generations
             monitor=plot, # plot the result at each iteration
             keepBest = TRUE, # keep the best solution at the end
             parallel = T, # allow parallel procesing
             seed=84211 # for reproducibility purposes
)
</code></pre>
<pre><code class="language-r"># Checking the results
summary(ga_GA_1)
</code></pre>
<pre><code class="language-r">── Genetic Algorithm ─────────────────── 

GA settings: 
Type                  =  binary 
Population size       =  50 
Number of generations =  50 
Elitism               =  3 
Crossover probability =  0.8 
Mutation probability  =  0.03 

GA results: 
Iterations             = 17 
Fitness function value = 0.2477393 
Solution = 
     radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean
[1,]           0            1              0         0               0                1
     concavity_mean concave points_mean symmetry_mean fractal_dimension_mean  ... 
[1,]              0                   0             0                      0      
     symmetry_worst fractal_dimension_worst
[1,]              0                       0
</code></pre>
<pre><code class="language-r"># Following line will return the variable names of the final and best solution
best_vars_ga=col_names[ga_GA_1@solution[1,]==1]

# Checking the variables of the best solution...
best_vars_ga
</code></pre>
<pre><code class="language-r">[1] &quot;texture_mean&quot;     &quot;compactness_mean&quot; &quot;area_worst&quot;       &quot;concavity_worst&quot; 
</code></pre>
<img src="https://blog.datascienceheroes.com/content/images/2019/01/GA_library_R.gif" alt="Feature Selection using Genetic Algorithms in R" width="450px">
<ul>
<li>Blue dot: Population fitness average</li>
<li>Green dot: Best fitness value</li>
</ul>
<p>Note: Don't expect the result that fast 😅</p>
<p>Now we calculate the accuracy based on the best selection!</p>
<pre><code class="language-r">get_accuracy_metric(data_tr_sample = data_x, target = data_y, best_vars_ga)
</code></pre>
<pre><code class="language-r">[1] 0.9508279
</code></pre>
<p>The accuracy is around 95,08%, while the ROC value is closed to 0,95 (ROC=fitness value * number of variables, check the fitness function).</p>
<h2 id="analyzingtheresults">Analyzing the results</h2>
<p>I don't like to analyze the accuracy without the cutpoint (<a href="https://livebook.datascienceheroes.com/model-performance.html#scoring_data">Scoring Data</a>), but it's useful to compare with the results of this <a href="https://www.kaggle.com/kanncaa1/feature-selection-and-data-visualization">Kaggle post</a>.</p>
<p>He got a similar accuracy result using recursive feature elimination, or RFE, based on 5 variables, while our solution stays with 4.</p>
<h2 id="tryityourself">Try it yourself</h2>
<p>Try a new fitness function, some solutions still provide a large number of variables, you can try squaring the number of variables.</p>
<p>Another thing to try is the algorithm to get the ROC value, or even to change the metric.</p>
<p>Some configurations last a lot of time. Balance classes before modeling and play with the  <code>p_sampling</code> parameter. Sampling techniques can have a big impact on models. Check the <a href="https://blog.datascienceheroes.com/sample-size-and-class-balance-on-model-performance/">Sample size and class balance on model performance</a> post for more info.</p>
<p>How about changing the rate of mutation or elitism? Or trying other cross-over methods?</p>
<p>Increase the <code>popSize</code> to test more possible solutions at the same time (at a time cost).</p>
<p>Feel free to share any insights or ideas to improve the selection.</p>
<p><strong><a href="https://github.com/pablo14/genetic-algorithm-feature-selection">Clone the repository</a></strong> to run the example.</p>
<h2 id="relatingconcepts">Relating concepts</h2>
<p>There is a parallelism between GA and Deep Learning, the concept of iteration and improvement over time is similar.</p>
<p>I added the <code>p_sampling</code> parameter to speed up things. And it usually accomplishes its goal. Similar to the <em>batch</em> concept used in Deep Learning. Another parallel is between the GA parameter <code>run</code> and the <em>early stopping</em> criteria in the neural network training.</p>
<p>But the biggest similarity is both techniques come from <strong>observing the nature</strong>. In both cases, humans observed how neural networks and genetics work, and create a simplified mathematical model that imitate their behavior. Nature has millions of years of evolution, why not try to imitate it? 🌱</p>
<p>--</p>
<p>I tried to be brief about GA, but if you have any specific question on this vast topic, please leave it in the comments 🙋 🙋‍♂</p>
<p><em>And, if I didn't motivate you the enough to study GA, check this project which is based on Neuroevolution:</em></p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/qv6UVOQ0F44" frameborder="0"></iframe>
<hr>
<p>Thanks for reading 🚀</p>
<p>Find me on <a href="https://twitter.com/pabloc_ds">Twitter</a> and <a href="https://www.linkedin.com/in/pcasas/">Linkedin</a>.<br>
<a href="https://blog.datascienceheroes.com/">More blog posts</a>.</p>
<p>Want to learn more? 📗 <a href="http://livebook.datascienceheroes.com/">Data Science Live Book</a></p>
</div>]]></content:encoded></item><item><title><![CDATA[Integrating R and Telegram]]></title><description><![CDATA[Get notify when an R script finishes on Telegram.]]></description><link>https://blog.datascienceheroes.com/get-notify-when-an-r-script-finishes-on-telegram/</link><guid isPermaLink="false">5bd47ff83f68cc3d8005d4a1</guid><category><![CDATA[telegram]]></category><category><![CDATA[R]]></category><dc:creator><![CDATA[Pablo Casas]]></dc:creator><pubDate>Wed, 07 Nov 2018 14:08:12 GMT</pubDate><media:content url="https://blog.datascienceheroes.com/content/images/2018/11/Screen-Shot-2018-11-07-at-11.04.09.png" medium="image"/><content:encoded><![CDATA[<div class="kg-card-markdown"><img src="https://blog.datascienceheroes.com/content/images/2018/11/Screen-Shot-2018-11-07-at-11.04.09.png" alt="Integrating R and Telegram"><p>Hi there!</p>
<p>tl;dr: Some models (deep learning) take a long time to finish. Even some data preparation scripts. We can be notified that the process ended by <a href="https://telegram.org/">Telegram</a> sending messages from R.</p>
<h2 id="getnotifiedbytelegrambot">Get notified by Telegram bot</h2>
<p>This section is entirely based on the documentation of <a href="https://ebeneditos.github.io/telegram.bot/">telegram.bot</a> package, by Ernest Benedito. Please visit the site to get used of the full capabilities of this package.</p>
<p>The idea of getting notify by Telegram is we can see the notification either on our cellphone or in the <a href="https://web.telegram.org">web version</a>.</p>
<h3 id="step1createabot">Step 1: Create a bot</h3>
<p>Find <code>@BotFather</code> on telegram. Send the message: <code>\start</code>. Then <code>\newbot</code>. And follow the instructions.</p>
<p>Save the bot token and never share publicly.</p>
<h3 id="step2setupthebot">Step 2: Set-up the bot</h3>
<p>After your bot is created. You have to send the message <code>\start</code>. And the bot is finally configurated!</p>
<h3 id="step3useitwithr">Step 3: Use it with R</h3>
<p>Put the bot token in the <code>.Renviron</code>:</p>
<pre><code class="language-r">user_renviron &lt;- path.expand(file.path(&quot;~&quot;, &quot;.Renviron&quot;))
file.edit(user_renviron) 
</code></pre>
<p>This should look something like this:</p>
<img src="https://blog.datascienceheroes.com/content/images/2018/10/token.png" width="500px" alt="Integrating R and Telegram">
<p>Now restart R.</p>
<pre><code class="language-r"># install.packages(&quot;telegram.bot&quot;)
library(telegram.bot)

# Initiate the bot session using the token from the enviroment variable.
bot = Bot(token = bot_token('arbot_bot'))

# The first time, you will need the chat id (which is the chat where you will get the notifications)
updates = bot$getUpdates()
</code></pre>
<pre><code class="language-r">&gt; updates
  update_id message.message_id message.from.id message.from.is_bot message.from.first_name message.from.last_name
1 639401623                  1       174860321               FALSE            admin                 admin
2 639401624                  2       174860321               FALSE            admin                 admin
  message.from.language_code message.chat.id message.chat.first_name message.chat.last_name message.chat.type message.date
1                      en-US       174860321            admin                 admin           private   1540571205
2                      en-US       174860321            admin                 admin           private   1540571208
  message.text  message.entities
1       /start 0, 6, bot_command
2        hello              NULL
</code></pre>
<p><strong>Time to use in the R workflow!</strong> We will send a test message and a plot:</p>
<p>Note 1: <code>chat_id</code>=<code>message.chat.id</code>.<br>
Note 2: <code>R_TELEGRAM_BOT_{the name of your bot}</code></p>
<pre><code class="language-r"># Sending text
message_to_bot=sprintf('Process finished - Accuracy: %s', 0.99)

bot$sendMessage(chat_id = 174860321, text = message_to_bot)

# Sending image (we need to save it first)
library(ggplot2)
my_plot=ggplot(mtcars, aes(x=mpg))  + geom_histogram(bins = 5)
ggplot2::ggsave(&quot;my_plot.png&quot;, my_plot)

bot$sendPhoto(chat_id = 174860321, photo = 'my_plot.png')
</code></pre>
<p>The results on telegram web:</p>
<img src="https://blog.datascienceheroes.com/content/images/2018/10/Screen-Shot-2018-10-27-at-11.44.48.png" alt="Integrating R and Telegram" width="550px">
<p>Note: I also tested: <a href="https://github.com/lbraglia/telegram">telegram</a> package and it works. However the <a href="https://ebeneditos.github.io/telegram.bot/">telegram.bot</a> seems more complete due to the bot options.</p>
<p>Check the full list of options to interact with the bot 🤖.</p>
<br>
<h2 id="getnotifiedbysound">Get notified by sound</h2>
<p>Another way of getting notified is by producing a sound: 🔔 <em>beep</em>!</p>
<pre><code class="language-r"># install.packages(&quot;beepr&quot;)
library(beepr)

## do some stuff, and...

beep()
beep()
</code></pre>
<img src="https://blog.datascienceheroes.com/content/images/2018/10/Screen-Shot-2018-10-27-at-12.14.42.png" width="300px" alt="Integrating R and Telegram">
<hr>
<p>Thanks for reading 🚀</p>
<p><a href="https://blog.datascienceheroes.com/get-notify-when-an-r-script-finishes-on-telegram/blog.datascienceheroes.com">Blog</a> | <a href="https://www.linkedin.com/in/pcasas/">Linkedin</a> | <a href="https://twitter.com/pabloc_ds">Twitter</a> | 📗 <a href="http://livebook.datascienceheroes.com/">Data Science Live Book</a></p>
</div>]]></content:encoded></item><item><title><![CDATA[How to apply a function to a matrix/tibble]]></title><description><![CDATA[How to apply a function to a matrix/tibble]]></description><link>https://blog.datascienceheroes.com/how-to-apply-a-function-to-a-matrix-tibble/</link><guid isPermaLink="false">5baa7d513f68cc3d8005d493</guid><category><![CDATA[tibble]]></category><category><![CDATA[rstats]]></category><category><![CDATA[R]]></category><dc:creator><![CDATA[Pablo Casas]]></dc:creator><pubDate>Tue, 25 Sep 2018 18:27:33 GMT</pubDate><media:content url="https://blog.datascienceheroes.com/content/images/2018/09/Screen-Shot-2018-09-25-at-15.25.24.png" medium="image"/><content:encoded><![CDATA[<div class="kg-card-markdown"><img src="https://blog.datascienceheroes.com/content/images/2018/09/Screen-Shot-2018-09-25-at-15.25.24.png" alt="How to apply a function to a matrix/tibble"><p>Scenario: we got a table of id-value, and a matrix/tibble that contains the id, and we need the labels.</p>
<p>It may be useful when predicting the Key (or Ids) of in a classification model (like in Keras), and we need the labels as the final output.</p>
<p>There are two interesting things:</p>
<ul>
<li>The usage of apply based on column and rows at the same time.</li>
<li>The creation of an empty tibble and how to fill it (append columns)</li>
</ul>
<h1 id="howtoapplyafunctiontoamatrixtibble">How to apply a function to a matrix/tibble</h1>
<p>Scenario: we got a table of id-value, and a matrix/tibble that contains the id, and we need the labels.</p>
<p>It may be useful when predicting the Key (or Ids) in a classification model (like in Keras), and we need the labels as the final output.</p>
<p>There are two interesting things:</p>
<ul>
<li>The usage of apply based on column and rows at the same time.</li>
<li>The creation of an empty tibble and how to fill it (append columns)</li>
</ul>
<pre><code class="language-r">library(tidyverse)
# mapping table (id-value)
map_table=tibble(id=c(1,2,3), 
                 value=c(&quot;a&quot;, &quot;b&quot;, &quot;c&quot;)
                 )

map_table
</code></pre>
<pre><code class="language-r">## # A tibble: 3 x 2
##      id value
##   &lt;dbl&gt; &lt;chr&gt;
## 1     1 a    
## 2     2 b    
## 3     3 c
</code></pre>
<pre><code class="language-r"># given a key, retrun the label
get_label &lt;- function(x) 
{
  res=filter(map_table, id==x)$value
  return(res)
}

# the data to get the label
X_data=tibble(v1=c(1,2,3), 
              v2=c(2,2,2),
              v3=c(3,2,1)
              )

X_data
</code></pre>
<pre><code class="language-r">## # A tibble: 3 x 3
##      v1    v2    v3
##   &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
## 1     1     2     3
## 2     2     2     2
## 3     3     2     1
</code></pre>
<h2 id="option1asmatrix">Option 1: as matrix</h2>
<pre><code class="language-r">mat_res=apply(X_data, 1:2, get_label)

## Checking...
mat_res
</code></pre>
<pre><code class="language-r">##      v1  v2  v3 
## [1,] &quot;a&quot; &quot;b&quot; &quot;c&quot;
## [2,] &quot;b&quot; &quot;b&quot; &quot;b&quot;
## [3,] &quot;c&quot; &quot;b&quot; &quot;a&quot;
</code></pre>
<h2 id="option2astibbleusingfor">Option 2: as tibble (using 'for')</h2>
<pre><code class="language-r"># creating a 1 column with NAs same length as nrow(X_data)
tib_res=tibble(V1=rep(NA, nrow(X_data))) 
for(i in 1:ncol(X_data))
{
  vec=X_data[,i]
  vec_lbl=sapply(t(vec), get_label) # if X_data is a matrid, no need to transpose with t()
  tib_res[,i]=vec_lbl
}

## Checking...
tib_res
</code></pre>
<pre><code class="language-r">## # A tibble: 3 x 3
##   V1    V2    V3   
##   &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
## 1 a     b     c    
## 2 b     b     b    
## 3 c     b     a
</code></pre>
<h2 id="option3astibbleusingmutate_all">Option 3: as tibble (using 'mutate_all')</h2>
<pre><code class="language-r">tib_res_2=mutate_all(X_data, .funs = get_label)
tib_res_2
</code></pre>
<pre><code class="language-r">## # A tibble: 3 x 3
##   v1    v2    v3   
##   &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
## 1 a     b     b    
## 2 b     b     b    
## 3 c     b     b
</code></pre>
<h2 id="finally">Finally...</h2>
<p>Option 2, to my surprise, is faster than the option 1.<br>
I didn't use the <code>add_column</code> because of the need of replacing the first dummy <code>NA</code> column.<br>
Other approaches may include dictionaries.</p>
<p>Any improvement in the code is welcome.</p>
<hr>
<p>Thanks for reading 🚀</p>
<p><a href="https://blog.datascienceheroes.com/how-to-apply-a-function-to-a-matrix-tibble/blog.datascienceheroes.com">Blog</a> | <a href="https://www.linkedin.com/in/pcasas/">Linkedin</a> | <a href="https://twitter.com/pabloc_ds">Twitter</a> | 📗 <a href="http://livebook.datascienceheroes.com/">Data Science Live Book</a></p>
</div>]]></content:encoded></item></channel></rss>