# Data Science - Short lesson on cluster analysis

### Introduction

In clustering you let data to be grouped according to their similarity. A cluster model is a group of segments -clusters- containing cases (such as clients, patients, cars, etc.).

Once a cluster model is developed, one question arises: *How can I describe my model?*

Here we present a way to approach this question, through the implementation of **Coordinate Plot** in **R** *(code available at the end of the post)*

### Cluster characteristics

In general a cluster model follows:

**High similarity**between cases inside the cluster.- Each cluster should be as
**unique**as it can, comparing with the others.

We will answer this question with one example. Each case in this data represents a country. We built a cluster model (k-means) with 3 clusters.

*Cluster model illustration, made of 2 variables and 3 clusters. Circles indicates the center of the cluster.*

### Coordinate plot

This is the graph to describe main characteristics of cluster model:

#### Coordinate plot characteristics

- Each color line represents a cluster, plus one extra line represents
**"All"**cases. - Each cluster has an average per each variable.
*And they go from 0 to 1 to be able to display all variables in one plot.* - For each variable, there will be always a number corresponding to 0 and another to 1. Because they represent the min and max value.
- Plot should be read vertical.

### How is scaled average built?

Looking at "LandArea" variable (which represents squared kilometers), we could say that C_2 (cluster 2) has the lowest average regarding land area. Following by C_1. C_3 has the highest value very far from the others clusters.

In other words, largest countries are in C_3, while the smallest ones are in C_2.

Next, there are the original values -which are not displayed- and their scaled average value:

- 1886206
*is converted into:*0.17 - 243509
*is converted into:*0.00 - 10014500
*is converted into:*1.00

The average for the whole data *(regardless clustering segmentation)*, is 884633 and is converted into: 0.06. That is the "All" line.

**Now we've got our 4 points, for variable land area.**

### Extracting conclusions

**Describing Cluster 3**

In C_3 there are the countries with the highest **LandArea** and **Population** (which are not always correlated). Regarding **Energy** and **LifeExpectancy**, they are the highest ones as well, this could be a metric of a developed country.

However they have the lowest **BirthRate**, it is not new that some developed countries has a low BirthRate.

**Describing Cluster 2**

C_2 is very similar to "All", so there is not much information here, this cluster has averages very similar to general population.

**Describing Cluster 1**

C_1 can be seen as the middle point regarding: **LandArea**, **Population**, **Energy** and **Rural**.

But is interesting to note that they have the highest **BirthRate** and the lowest **LifeExpectancy**, plus a high **Rural** variable (percentage of population living in a rural zone).

This is the opposite as C_3.

Looking at these metrics, we can write the headlines:

- C_3 => High developed countries
- C_1 => Low developed countries

### Finally...

- R code: Coordinate plot installation & usage available in GitHub and here (examples)

Thanks for reading :)