Hi there! This post is an experiment combining the result of **t-SNE** with two well known clustering techniques: **k-means** and **hierarchical**. This will be the practical section, in **R**.

But also, this post will explore the intersection point of concepts like dimension reduction, clustering analysis, data preparation, PCA, HDBSCAN, k-NN, SOM, deep learning...and Carl Sagan!

*First published at: http://blog.datascienceheroes.com/playing-with-dimensions-from-clustering-pca-t-sne-to-carl-sagan*

For those who don't know **t-SNE** technique (official site), it's a projection technique -or dimension reduction- similar in some aspects to Principal Component Analysis (PCA), used to visualize N variables into 2 (for example).

When the t-SNE output is poor Laurens van der Maaten (t-SNE's author) says:

As a sanity check, try running PCA on your data to reduce it to two dimensions. If this also gives bad results, then maybe there is not very much nice structure in your data in the first place. If PCA works well but t-SNE doesn’t, I am fairly sure you did something wrong.

In my experience, doing PCA with dozens of variables with:

- some extreme values
- skewed distributions
- several dummy variables,

Doesn't lead to good visualizations.

Check out this example comparing the two methods:

Source: Clustering in 2-dimension using tsne

Makes sense, doesn’t it?

Since one of the **t-SNE** results is a matrix of two dimensions, where each dot reprents an input case, we can apply a clustering and then group the cases according to their distance in this **2-dimension map**. Like a geography map does with mapping 3-dimension (our world), into two (paper).

**t-SNE** puts similar cases together, handling non-linearities of data very well. After using the algorithm on several data sets, I believe that in some cases it creates something like *circular shapes* like islands, where these cases are similar.

However I didn't see this effect on the live demonstration from the Google Brain team: How to Use t-SNE Effectively. Perhaps because of the nature of input data, 2 variables as input.

t-SNE according to its FAQ doesn't work very well with the *swiss roll* -toy- data. However it's a stunning example of how a 3-Dimension surface (or **manifold**) with a concrete spiral shape is unfold like paper thanks to a reducing dimension technique.

The image is taken from this paper where they used the manifold sculpting technique.

**t-SNE** helps make the cluster more accurate because it converts data into a 2-dimension space where dots are in a circular shape (which pleases to k-means and it's one of its weak points when creating segments. More on this: K-means clustering is not a free lunch).

Sort of **data preparation** to apply the clustering models.

```
library(caret)
library(Rtsne)
######################################################################
## The WHOLE post is in: https://github.com/pablo14/post_cluster_tsne
######################################################################
## Download data from: https://github.com/pablo14/post_cluster_tsne/blob/master/data_1.txt (url path inside the gitrepo.)
data_tsne=read.delim("data_1.txt", header = T, stringsAsFactors = F, sep = "\t")
## Rtsne function may take some minutes to complete...
set.seed(9)
tsne_model_1 = Rtsne(as.matrix(data_tsne), check_duplicates=FALSE, pca=TRUE, perplexity=30, theta=0.5, dims=2)
## getting the two dimension matrix
d_tsne_1 = as.data.frame(tsne_model_1$Y)
```

Different runs of `Rtsne`

lead to different results. So more than likely you will not see exactly the same model as the one present here.

According to the official documentation, `perplexity`

is related to the importance of neighbors:

*"It is comparable with the number of nearest neighbors k that is employed in many manifold learners."**"Typical values for the perplexity range between 5 and 50"*

Object `tsne_model_1$Y`

contains the X-Y coordinates (`V1`

and `V2`

variables) for each input case.

Plotting the t-SNE result:

```
## plotting the results without clustering
ggplot(d_tsne_1, aes(x=V1, y=V2)) +
geom_point(size=0.25) +
guides(colour=guide_legend(override.aes=list(size=6))) +
xlab("") + ylab("") +
ggtitle("t-SNE") +
theme_light(base_size=20) +
theme(axis.text.x=element_blank(),
axis.text.y=element_blank()) +
scale_colour_brewer(palette = "Set2")
```

And there are the famous "islands" 🏝️. At this point, we can do some clustering by looking at it... But let's try k-Means and hierarchical clustering instead 😄. t-SNE's FAQ page suggest to decrease perplexity parameter to avoid this, nonetheless I didn't find a problem with this result.

Next piece of code will create the **k-means** and **hierarchical** cluster models. To then assign the cluster number (1, 2 or 3) to which each input case belongs.

```
## keeping original data
d_tsne_1_original=d_tsne_1
## Creating k-means clustering model, and assigning the result to the data used to create the tsne
fit_cluster_kmeans=kmeans(scale(d_tsne_1), 3)
d_tsne_1_original$cl_kmeans = factor(fit_cluster_kmeans$cluster)
## Creating hierarchical cluster model, and assigning the result to the data used to create the tsne
fit_cluster_hierarchical=hclust(dist(scale(d_tsne_1)))
## setting 3 clusters as output
d_tsne_1_original$cl_hierarchical = factor(cutree(fit_cluster_hierarchical, k=3))
```

Now time to plot the result of each cluster model, based on the t-SNE map.

```
plot_cluster=function(data, var_cluster, palette)
{
ggplot(data, aes_string(x="V1", y="V2", color=var_cluster)) +
geom_point(size=0.25) +
guides(colour=guide_legend(override.aes=list(size=6))) +
xlab("") + ylab("") +
ggtitle("") +
theme_light(base_size=20) +
theme(axis.text.x=element_blank(),
axis.text.y=element_blank(),
legend.direction = "horizontal",
legend.position = "bottom",
legend.box = "horizontal") +
scale_colour_brewer(palette = palette)
}
plot_k=plot_cluster(d_tsne_1_original, "cl_kmeans", "Accent")
plot_h=plot_cluster(d_tsne_1_original, "cl_hierarchical", "Set1")
## and finally: putting the plots side by side with gridExtra lib...
library(gridExtra)
grid.arrange(plot_k, plot_h, ncol=2)
```

In this case, and based only on visual analysis, hierarchical seems to have more *common sense* than k-means. Take a look at following image:

*Note: dashed lines separating the clusters were drawn by hand*

In k-means, the distance in the points at the bottom left corner are quite close in comparison to the distance of other points inside the same cluster. But they belong to different clusters. Illustrating it:

So we've got: red arrow is shorter than blue arrow...

*Note: Different runs may lead to different groupings, if you don't see this effect in that part of the map, search it in other.*

This effect doesn't happen in the hierarchical clustering. Clusters with this model seems more even. But what do you think?

It's not fair to k-means to be compared like that. Last analysis based on the idea of **density clustering**. This technique is really cool to overcome the pitfalls of simpler methods.

**HDBSCAN** algorithm bases its process in densities.

Find the essence of each one by looking at this picture:

Surely you understood the difference between them...

Last picture comes from Comparing Python Clustering Algorithms. Yes, Python, but it's the same for R. The package is largeVis. *(Note: Install it by doing: install_github("elbamos/largeVis", ref = "release/0.2")*.

Quoting Luke Metz from a great post (Visualizing with t-SNE):

*Recently there has been a lot of hype around the term “deep learning“. In most applications, these “deep” models can be boiled down to the composition of simple functions that embed from one high dimensional space to another. At first glance, these spaces might seem to large to think about or visualize, but techniques such as t-SNE allow us to start to understand what’s going on inside the black box. Now, instead of treating these models as black boxes, we can start to visualize and understand them.*

A deep comment 👏.

Beyond this post, **t-SNE** has proven to be a really **great** general purpose tool to reduce dimensionality. It can be use to explore the relationships inside the data by building clusters, or to analyze anomaly cases by inspecting the isolated points in the map.

Playing with dimensions is a key concept in data science and machine learning. Perplexity parameter is really similar to the *k* in nearest neighbors algorithm (k-NN). Mapping data into 2-dimension and then do clustering? Hmmm not new buddy: Self-Organising Maps for Customer Segmentation.

When we select the best features to build a model, we're reducing the data's dimension. When we build a model, we are creating a function that describes the relationships in data... and so on...

Did you know the general concepts about k-NN and PCA? Well this is one more step, just plug the cables in the brain and that's it. Learning general concepts gives us the opportunity to do this kind of associations between all of these techniques. Despite comparing programming languages, the power -in my opinion- is to have the focus on how data behaves, and how these techniques are and can ultimately be- connected.

Explore the imagination with this **Carl Sagan**'s video: Flatland and the 4th Dimension. A tale about the interaction of 3D objects into 2D plane...

Data Science Live Book (open source): An intuitive and practical approach to data analysis, data preparation and machine learning, suitable for all ages! 🚀

This package heavily relies on dplyr for database abstraction, it theoretically works with any dplyr-compatible database, but may require some tuning for some of the databases.

The way you should use this app is to build your chart with the `Sample mode`

, and when you have the visualization you want, you untick the sample mode, which goes to the database to fetch the complete dataset you need. The app does some tricks with dplyr to avoid over-querying data.

Scatterplot example:

Line chart example:

The app has to be configured by placing 2 files on the root of the project: config.R and update_samples.R. Example files using sqlite have been provided in the examples folder.

Before using the shiny app, you have to execute the script "update_samples.R", which will download samples of all tables (Might take a while on big databases). If you only want to query a subset of your tables, modify this script so it only finds those tables.

This script should be reran ocasionally, depending on how much your database changes, maybe daily, weekly or monthly, use your job scheduler or cron to do execute this script.

Also, the stack overflows easily because of the level of recursion used, on the server or machine where you deploy this, you should allow for big stack sizes, i've tried and it worked fine with the unlimited setting in my experience. (This command works on Linux, you should find your equivalent in other operating systems if you find stack overflow errors)

```
ulimit -s unlimited
```

This is a very preliminar release, a lot of things may be missing, pull requests are welcome!

Some examples of missing features:

- Some advanced settings to control appearance have to be added
- Bar charts
- Histograms
- Faceting maybe?
- More database examples
- Bookmarking charts (It fails with the default bookmarker)

The repository you can download this application from is:

https://github.com/sicarul/shiny-chart-builder

If you've never visit the Data Science Live Book...
here's the home page

This method works but has some issues, Sebastian Peyrott has written an excellent new blogpost that explains how to add authentication to the Open Source edition of Shiny from scratch,

]]>This method works but has some issues, Sebastian Peyrott has written an excellent new blogpost that explains how to add authentication to the Open Source edition of Shiny from scratch, using a node.js proxy and Nginx.

With this you'll be able to serve internal reports without going to an expensive solution or doing everything from scratch. You can read the blogpost at:

https://auth0.com/blog/adding-authentication-to-shiny-server/

Hi there! I decided to *almost* re-write the model validation section since it didn't reflect real case scenarios.

Hopefully in the two new chapters you will gain a deeper knowledge on methodological aspects on model validation through classical cross-validation, bootstrapping, and going further in the **nature of the error**. And also take advantage of validation when data is **time dependent**.

There is a lot more to tell about model validation, but it's a kick start.

Coming soon, there will be an update on methodological aspects in **data preparation**.

If you've never visit the **#dslivebook**...
here's the home page

*First published at: http://blog.datascienceheroes.com/model-performance-in-data-science-live-book*

This update contains a new chapter -**scoring**- which is related to **model performance** and **model deployment**, used when predicting a binary outcome.

**Important**: To use following updates please update funModeling package :)

`install.packages("funModeling")`

Also related to predictive modelling for binary outcome, there is a new chapter based on how to compare models using the **gain** and **lift charts**.

Link to the gain and lift chapter.

Finally there is a new function, `freq`

, which generates the common **frequency analysis** plus the table with the numbers.

*This function can runs automatically for all the data input, and export all the images at once*.

Link to the frequency function It's at the bottom of the page.

If you've never visit the **#dslivebook**...
here's the home page

*First published at: http://blog.datascienceheroes.com/data-science-live-book-scoring-model-performance-profiling-update*

Hi! Well finally there is the first release of this project: A **open source** book which will *hopefully* contain some useful resources for those who want to learn some data analysis/machine learning.

This release covers a little of data preparation, data profiling, selecting best variables (DataViz), assessing model performance, and coming soon a case study using decision trees.

*An intuitive and practical approach to data analysis, data preparation and machine learning, suitable for all ages!* 🚀

*First published at: http://blog.datascienceheroes.com/data-science-live-book-open-source*

Time series have **maximum** and **minimum** points as general patterns. Sometimes the noise present on it causes problems to spot general behavior.

In this post, we will **smooth** time series -reducing noise- to maximize the story that data has to tell us. And then, an easy formula will be applied to find and plot max/min points thus characterize data.

```
# reading data sources, 2 time series
t1=read.csv("ts_1.txt")
t2=read.csv("ts_2.txt")
# plotting...
plot(t1$ts1, type = 'l')
plot(t2$ts2, type = 'l')
```

As you can see there are many peaks, but intuitively you can imagine a more smoother line crossing in the middle of the points. This can achieved by applying a **Seasonal Trend Decomposition** (STL).

```
# first create the time series object, with frequency = 50, and then apply the stl function.
stl_1=stl(ts(t1$ts1, frequency=50), "periodic")
stl_2=stl(ts(t2$ts2, frequency=50), "periodic")
```

*Important*: If you don't know the `frequency`

beforehand, play a little bit with this parameter until you find a result in which you are comfortable.

Creating the functions...

```
ts_max<-function(signal)
{
points_max=which(diff(sign(diff(signal)))==-2)+1
return(points_max)
}
ts_min<-function(signal)
{
points_min=which(diff(sign(diff(-signal)))==-2)+1
return(points_min)
}
```

```
trend_1=as.numeric(stl_1$time.series[,2])
max_1=ts_max(trend_1)
min_1=ts_min(trend_1)
## Plotting final results
plot(trend_1, type = 'l')
abline(v=max_1, col="red")
abline(v=min_1, col="blue")
```

With the line: `stl_1$time.series[,2]`

we are accessing the time series `trend`

component. This is the smoothing method we will use, but there are others.

This first series has 3 maximums *(red line)* and 2 minimums *(blue line)* in the following places:

```
# When the max points occurs:
max_1
```

```
# When the min points occurs:
min_1
```

```
trend_2=as.numeric(stl_2$time.series[,2])
max_2=ts_max(trend_2)
min_2=ts_min(trend_2)
# create two aligned plots
par(mfrow=c(2,1))
## Plotting series 1
plot(trend_1, type = 'l')
abline(v=max_1, col="red")
abline(v=min_1, col="blue")
## Plotting series 2
plot(trend_2, type = 'l')
abline(v=max_2, col="red")
abline(v=min_2, col="blue")
```

Some **conclusions** from both plots:

`Series 2`

starts with a`min`

while 1 does with a`max`

`Series 1`

has 3`max`

and 2`min`

, just the opposite to the other series

Why is this important? Because of the nature of the data, which is in next section.

`ts1`

and `ts2`

are two typical responses to a brain stimulus, in other words: what happens with the brain when a person looks at a picture / move a finger / think in a particular thing, etc... Electroencephalography.

Some studies in **neuroscience** focus on averaging several responses to one stimulus -for example, to look at one particular picture. They present several times a particular image to the person. Averaging all of these signal/time series, you get the **typical response**.

Then you can **predict** based on the similarity between this **typical response** and the **new image** (stimulus) that the person is looking at.

It's important to get the **when** the positive peaks occur. In this case they are: `P1`

, `P2`

and `P3`

. The same goes for the negative ones.

Wiki: Event related potential.

*Note: It´s a common practice to invert negative and positive values.*

Typically the signal time length for this kind of studies last for **400ms**, thus 1 point per millisecond, just the displayed plots. And the amplitude is in **volts**, *(actually micro-volts)*. The same unit of measurement used by the notebook you are using now ;)

Reproduce all the analysis with
this repository.

Amazon's columnar database, Redshift is a great companion for a lot of Data Science tasks, it allows for fast processing of very big datasets, with a familiar query language (SQL).

There are 2 ways to load data into Redshift, the classic one, using the `INSERT`

statement, works, but it is highly inefficient when loading big datasets. The other one, recommended in Redshift's docs, consists on using the `COPY`

statement.

One of the easiests ways to accomplish this, since we are already using Amazon's infrastructure, is to do a load from S3. S3 loading requires that you upload your data to Redshift and then run a `COPY`

statement specifying where your data is.

Also, because Redshift is a distributed database, they recommend you to split your file, in a number of files which are a multiple of the number of slices on your database, so they can load it in parallel, and they also let you gzip your files for a faster upload.

Wait a second, now to upload a big dataset fast we have to:

- Create a table in Redshift with the same structure as my data frame
- Split the data into N parts
- Convert the parts into a format readable by Redshift
- Upload all the parts to Amazon S3
- Run the COPY statement on Redshift
- Delete the temporary files on Amazon S3

That does seem like a lot of work, but don't worry, i've got your back! I've created an R Package which does exactly this, it's `redshiftTools`

! The code is available at github.com/sicarul/redshiftTools.

To install the package, you'll need to do:

```
install.packages('devtools')
devtools::install_github("RcppCore/Rcpp")
devtools::install_github("rstats-db/DBI")
devtools::install_github("rstats-db/RPostgres")
install.packages("aws.s3", repos = c(getOption("repos"), "http://cloudyr.github.io/drat"))
devtools::install_github("sicarul/redshiftTools")
```

Afterwards, you'll be able to use it like this:

```
library("aws.s3")
library(RPostgres)
library(redshiftTools)
con <- dbConnect(RPostgres::Postgres(), dbname="dbname",
host='my-redshift-url.amazon.com', port='5439',
user='myuser', password='mypassword',sslmode='require')
rs_replace_table(my_data, dbcon=con, tableName='mytable', bucket="mybucket")
rs_upsert_table(my_other_data, dbcon=con, tableName = 'mytable', bucket="mybucket", keys=c('id', 'date'))
```

`rs_replace_table`

truncates the target table and then loads it entirely from the data frame, only do this if you don't care about the current data it holds. On the other hand, `rs_upsert_table`

replaces rows which have coinciding keys, and inserts those that do not exist in the table.

Please open an issue in Github if you find any issues. Have fun!

]]>

Good news! funModeling documentation evolved into an open source book! Please follow the link below

This release covers a little of data preparation, data profiling, selecting best variables (dataviz), assessing model performance, and more coming soon.

Github.
Sneak peek into the *(either for learning or to contribute -code not complex and commented)*:

`funModeling`

"black-box"

Inspired by this Netflix post, I decided to write a post based on this topic using R.

There are several nice packages to achieve this goal, the one we´re going to review is **AnomalyDetection**.

Download full -*and tiny*- R code of this post here.

The definition for abnormal, or outlier, is an element which **does not follow the behaviour of the majority**.

Data has noise, same example as a radio which doesn't have good signal, and you end up listening to some background noise.

- The orange section could be
**noise in data**, since it oscillates around a value without showing a defined pattern, in other words: White noise *Are the red circles noise or they are peaks from an undercover pattern?*

A good algorithm can detect abnormal points considering the inner noise and leaving it behind. The

`AnomalyDetectionTs`

in`AnomalyDetection`

package can perform this task quite well.

In this example, data comes from the well known wikipedia, which offers an API to download from R the `daily page views`

given any `{term + language}`

.

In this case, we've got page views from term `fifa`

, language `en`

, from `2013-02-22`

up to today.

After applying the algorithm, we can plot the original time series plus the **abnormal points** in which the page views were over the expected value.

Parameters in algorithm are `max_anoms=0.01`

(to have a maximum of `0.01%`

outliers points in final result), and `direction="pos"`

to detect anomalies over (not below) the expected value.

As a result, **8 anomalies dates** were detected. Additionally, the algorithm returns what it would have been the **expected value**, and an extra calculation is performed to get this value in terms of percentage `perc_diff`

.

*If you want to know more about the maths behind it, google: Generalized ESD and time series decomposition*

**Something went wrong:**
Something strange since 1st expected value is the same value as the series has (`34028`

page views). As a matter of fact `perc_diff`

is 0 while it should be a really low number. However the anomaly is well detected and apparently next ones too. *If you know why, you can email and share the knowledge* :)

Last plot shows a line indicating **linear trend** over an specific period -clearly decreasing-, and **two black circles**. It's interesting to note that these black points **were not** detected by the algorithm because they are part of a decreasing tendency (noise perhaps?).

A really nice shot by this algorithm since the focus on detections are on the **changes of general patterns**. Just take a look at the last detected point in that period, it was a peak that didn't follow the **decreasing pattern** (occurred on `2014-07-12`

).

These anomalies with the term `fifa`

are correlated with the news, **the first group of anomalies** is related with the FIFA World Cup (around **Jun/Jul 2014**), and **the second group** centered on **May 2015** is related with FIFA scandal.

In the LA Times it can be found a timeline about the scandal, and two important dates -**May 27th and 28th**-, which are two dates **found by the algorithm**.

There is a complete chapter in the Data Science Live Book which covers the **outliers treatment** issue, which can be seen in a way as some kind of anomalous data. All the examples are in R and the topic is covered from both perspectives, practical and theoretical.

Thanks for reading :)

Data Science Live Book (open source): An intuitive and practical approach to data analysis, data preparation and machine learning, suitable for all ages! 🚀

Big Data help us to analyze unstructred data (aka "text" ), with many techniques, in this post it is presented one: Cosine Similarity.

There are also other analysts work, who scraped data from twitter who spot some airplane complains from passangers.

**Cosine similarity** is a technique to measure how similar are two documents, based on the words they have.

This link explains very well the concept, with an example which is replicated in R later in this post.

*Quick summary:* Imagine a document as a vector, you can build it just counting word appearances. If you have two vectors, they will have an angle.

**If the documents have almost the same words, then the cosine of those vectors will be near to 1. Otherwise this score will be close to 0**.

I replicated the example in R:

*1) Julie loves me more than Linda loves me*

*2) Jane likes me more than Julie loves me*

Word counting per sentece:

```
sentence_1=c(2, 1, 0, 2, 0, 1, 1, 1)
sentence_2=c(2, 1, 1, 1, 1, 0, 1, 1)
crossprod(sentence_1, sentence_2)/sqrt(crossprod(sentence_1) * crossprod(sentence_2))
```

And the result is... `0.8215838`

!

Now imagine we delete the word `Julie`

from sentence 1. The new vector for sentence 1 is:

`sentence_1=c(2, 0, 0, 2, 0, 1, 1, 1)`

*(2nd element is now 0)*

And the new result is...

`0.7627701`

**Conclusion**: Deleting the word `Julie`

causes the sentences to be less similar.

This kind of techinques, allow us to **order** the data and take a decision quickly.

Airplane users used to have many complains about airlines, and they express their dissatisfaction through the popular Twitter.

In this real case Jeffrey Breen scrapes data from twitter, and then apply many text/sentimental mining techniques.

Here, the post.

Do you want to start your own project? Just follow this great tutorial made by Yanchan Zhao. I'm aware this is not new, but someone new to this topic may benefit from this.

Last links showed how to analyze text considering one word at a time, but what about phrases?

For example, the sentence: ** "I don't like to wait in the airport"**.

It's not the same to analyze the correlation between

the words:

- "don't",
- "like",
- "wait"

Than to analyze the correlation between:

- "don't like"
- "wait"

In 1st case, the algorithm may show you a correlation between:

- "don't"
*and*"like" - "don't"
*and*"wait" - "like"
*and*"wait" -really? ;)

In 2nd case, the result may be something like:

- "don't like"
*and*"wait"

Much more clear, isn't it?

If you want to consider words as phrases -*the 2nd case*-, take a look at this answer from *stackoverflow.com*.

Thanks for reading :)

May be you have tried it but are reclutant to use it in production because it lacks any authentication on it's open source version. I've written a

]]>Shiny Server is a great solution for BI/analytics reporting. It leverages the power of the R language to create interactive reports/dashboards.

May be you have tried it but are reclutant to use it in production because it lacks any authentication on it's open source version. I've written a post on Auth0's blog that explains how you can use Auth0 to create a nice login page protecting your reports, which can hook up users from other systems (Google? Linkedin? LDAP? AD? Your custom MySQL Database? An Auth0 hosted DB? All apply!).

Click here to read the full article on Auth0's blog

Disclaimer: I'm working as a Data Engineer in Auth0

]]>These systems are used in cross-selling industries, and they measure correlated items as well as their user rate. This last point wasn't included the apriori algorithm (or association rules), used in market basket analysis.

The link: http://blog.yhathq.com/posts/recommender-system-in-r.html

Here is another nice page showing a

]]>These systems are used in cross-selling industries, and they measure correlated items as well as their user rate. This last point wasn't included the apriori algorithm (or association rules), used in market basket analysis.

The link: http://blog.yhathq.com/posts/recommender-system-in-r.html

Here is another nice page showing a simpler case, the one known as association rules - market basket analysis: http://www.rdatamining.com/examples/association-rules

This is not related with R, but a really interesting paper about how Amazon works on item-to-item collaborative filtering: http://www.cs.umd.edu/~samir/498/Amazon-Recommendations.pdf

A really challenging objective since they process all the information on-demand to give recommendations in real time.

This is an excellent resource to understand 2 types of data frame format: Long and Wide.

- Just take a look at figure 1 inside the article

1) **Long format**: ggplot2 needs in certain scenarios this kind of format to work (generally grouped plots).

2) **Wide format**: On the other hand, usually when you read transnational data, you may find "long-format" and you need it in "wide" in order to create a predictive model.

Here, each row represents a *case study*, and each column an attribute/variable. Classical input for building a cluster or predictive model.

The most used library to achieve this is **"reshape2"**, and, *what's the difference with "reshape"?*

Package author said:

"Reshape2 is a reboot of the reshape package. It's been over five years since the first release of the package"..."reshape2 uses that knowledge to make a new package for reshaping data that is much more focused and much much faster."

Happy transforming!

"**I want to develop a model that automatically learns over time**", a really challenging objective. We'll develop in this post a procedure that loads data, build a model, make predictions and, if something changes over time, it will create a new model, all with **R**.

*Picture credit: S.H Horikawa*

This post intends to recreate *as simple as possible* the **machine learning scenario**: automatically creation of a predictive model with temporal concerns. It's going to be kind of manual because the objective is to cover a little of the logic behind a *machine that learns*.

*Start with "Small Data" to conquer Big Data ;)*

We have one input (age) and one output variable (purchases). We want to predict next months purchases based on age. If the model is inaccurate, a new one should be built.

In machine learning, it's quite important to understand temporality. We will stand in 3 different dates to introduce this concept:

- 1 Model building (January)
- 2 Model perfoming ok (February to April)
- 3.1 Model perfoming bad (May)
- 3.2 New model building (May)

We're on January, and we're building the model with historical data, when we know both variables, age and purchases.

```
## Loading needed libraries
suppressMessages(library(ggplot2))
suppressMessages(library(forecast))
```

Find the data sets used in this example in Github

```
## Reading historical data
set.seed(999)
data_historical=read.delim(file="data_historical.txt", header=T, sep="\t")
```

```
## Plotting current relationship between age and purchases
ggplot(data_historical, aes(x=age, y=purchases)) +
geom_point(shape=1) + ## Points as circles (good to see density)
geom_smooth(method=lm) ## Linear regression line
## Model creation. Input variable: "age", to predict "purchases".
model=lm(purchases~age, data=data_historical)
```

*Probably you know the linear regression, but if you don't, check this.*

Clearly the relationship between age and purchases is linear. After building the linear regression model, we check one accuracy metric: **MAPE** (Mean Average Percentage Error), *close to 0, better*.

MAPE measures how different is the prediction against the real value (in terms of percentage).

```
## Checking accuracy model
historical_error=round(accuracy(model)[,"MAPE"],2)
historical_error
## Setting up error threshold (to be used later)
threshold=10 ## 10 represents "10%" of error (MAPE)
```

- MAPE in historical data is: 7.97% (
**historical_error**variable).

It is expected to have a similar value over next months, if not, the model is not a good representation of reality.

**Defining threshold**:

There is the need to define an error threshold value, let's say if the error (measuring by MAPE) in the following months is higher than **10%**, model has to be rebuilding.

This

rebuildingis the key point here, we can automate the process to take new data, build a new model, and if this new model has an error below threshold, then it becomes the new model in production,(the simplest scenario)

```
## Checking model coefficients
model$coefficients
```

*R output:*

```
(intercept) age
-15.4992 100.3812
```

In other words, this is how the model looks like:

purchases=100.3812*age - 15.4992

During this period new customers arrive, the model to forecast purchases is applied the first day of each month. As a matter of fact we know how the model performed during this 3-month period, looking at real error (MAPE): predicted purchases vs. real purchases.

*Note: Performance simulation and re-building with R code will be in next step (May)*

Error table shows the following:

As it can be seen, there's an **increasing tendency in error**, getting closer to the maximum allowed.

Now we're in May 31th. It is known how purchases were over current month. Following procedure should be executed at the end of every month.

```
## Read data from past month, May.
data_may=read.delim(file="data_may.txt", header=T, sep="\t")
## Retrieve the predictions made on May 1st based on the model built on January.
forecasted_purchases=predict.lm(object = model, newdata = data.frame(age=data_may$age))
## Checking error
error_may=accuracy(forecasted_purchases, data_may$purchases)[,"MAPE"]
error_may
## Difference to threshold (10%)
threshold-error_may
```

*R output says:*

"error*_*may" is 18.79473, and "threshold-error_may" is -8.794733

In this month **the error exceed the threshold** by 8.79%.

This is how the model is working on May:

```
## Further inspection plotting forecasted (blue) against actual (black) purchases.
ggplot(data_may, aes(x=age)) +
geom_line(aes(y=forecasted_purchases), colour="blue") +
geom_point(aes(y=purchases), shape=1)
```

Clearly, the model works well predicting purchases on customers **before** 35 years-old, and becomes **missaccuarate for older people**. This segment is buying more than before.

It could be caused for example because of some change on business policy, a discount which is no more available, etc.

A new model must be created returning new error metrics.

```
## Procedure to generate a new model
if(error_may>threshold) {
## Build new model, based on new data.
new_model=lm(purchases~age, data=data_may)
## Assign predictions to 'May' data. They are the predictions for training data.
data_may$forecasted_purchases=new_model$fitted.values
## Plot: new Linear regression
p=ggplot(data_may, aes(x=age)) +
geom_line(aes(y=forecasted_purchases), colour="blue") +
geom_point(aes(y=purchases), shape=1)
print(p)
new_error=accuracy(new_model)[,"MAPE"]
if(new_error<threshold)
{print("We have a new model built in an automated process! =)")} else
{print("Manual inspection & building =(")}
}
```

*R output:*

```
"We have a new model built in an automated process! =)"
```

We have the new model to run next month (June):

```
## Checking new model coefficients
new_model$coefficients
```

*R output:*

```
(intercept) age
-4414.1504 244.8179
```

*In other words...*

purchases=244.8179*age - 4414.1504

And there is the final model!

When a

**variable changes**its distribution, affecting*significantly*prediction accuracy, the model**should be checked**(in our case, 10%).Other case is when a

**new variable appears**, one that we didn't know when the model was built. Most advanced systems take care of this and automatically map this new concept. Like a search engine with new terms.The

*most*important point here is the concept of**closed-system**: The error is checked every month and determines if the model has or not to be re-adjusted.- One step ahead is to use the error to iteratively adapt the model (for example, testing other type of models, with other parameters) until the minimum error is reached.
Something similar to
**Artificial Neural Networks**model, which measures error iteratively (hundreds or thousands of times...) to have a proper balance between*generalization*and*particularization*.

- One step ahead is to use the error to iteratively adapt the model (for example, testing other type of models, with other parameters) until the minimum error is reached.
Something similar to

- Linkedin group, post questions and/or share something related to data science. Share if you like ;)