Data Science - Short lesson on cluster analysis

Introduction

In clustering you let data to be grouped according to their similarity. A cluster model is a group of segments -clusters- containing cases (such as clients, patients, cars, etc.).

Once a cluster model is developed, one question arises: How can I describe my model?

Here we present a way to approach this question, through the implementation of Coordinate Plot in R (code available at the end of the post)

Cluster characteristics

In general a cluster model follows:

  • High similarity between cases inside the cluster.
  • Each cluster should be as unique as it can, comparing with the others.

We will answer this question with one example. Each case in this data represents a country. We built a cluster model (k-means) with 3 clusters.


Cluster model illustration, made of 2 variables and 3 clusters. Circles indicates the center of the cluster.


Coordinate plot

This is the graph to describe main characteristics of cluster model:


Coordinate plot characteristics

  • Each color line represents a cluster, plus one extra line represents "All" cases.
  • Each cluster has an average per each variable. And they go from 0 to 1 to be able to display all variables in one plot.
  • For each variable, there will be always a number corresponding to 0 and another to 1. Because they represent the min and max value.
  • Plot should be read vertical.

How is scaled average built?

Looking at "LandArea" variable (which represents squared kilometers), we could say that C_2 (cluster 2) has the lowest average regarding land area. Following by C_1. C_3 has the highest value very far from the others clusters.

In other words, largest countries are in C_3, while the smallest ones are in C_2.

Next, there are the original values -which are not displayed- and their scaled average value:

  • 1886206 is converted into: 0.17
  • 243509 is converted into: 0.00
  • 10014500 is converted into: 1.00

The average for the whole data (regardless clustering segmentation), is 884633 and is converted into: 0.06. That is the "All" line.

Now we've got our 4 points, for variable land area.

Extracting conclusions

Describing Cluster 3

In C_3 there are the countries with the highest LandArea and Population (which are not always correlated). Regarding Energy and LifeExpectancy, they are the highest ones as well, this could be a metric of a developed country.

However they have the lowest BirthRate, it is not new that some developed countries has a low BirthRate.

Describing Cluster 2

C_2 is very similar to "All", so there is not much information here, this cluster has averages very similar to general population.

Describing Cluster 1

C_1 can be seen as the middle point regarding: LandArea, Population, Energy and Rural.
But is interesting to note that they have the highest BirthRate and the lowest LifeExpectancy, plus a high Rural variable (percentage of population living in a rural zone).
This is the opposite as C_3.

Looking at these metrics, we can write the headlines:

  • C_3 => High developed countries
  • C_1 => Low developed countries

Finally...

Thanks for reading :)

DSH Twitter

DSH Facebook

More DSH posts!