Text Mining Analysis: some theory and practice in R
Introduction
Big Data help us to analyze unstructred data (aka "text" ), with many techniques, in this post it is presented one: Cosine Similarity.
There are also other analysts work, who scraped data from twitter who spot some airplane complains from passangers.
Similarity between two documents
Cosine similarity is a technique to measure how similar are two documents, based on the words they have.
This link explains very well the concept, with an example which is replicated in R later in this post.
Quick summary: Imagine a document as a vector, you can build it just counting word appearances. If you have two vectors, they will have an angle.
If the documents have almost the same words, then the cosine of those vectors will be near to 1. Otherwise this score will be close to 0.
I replicated the example in R:
1) Julie loves me more than Linda loves me
2) Jane likes me more than Julie loves me
Word counting per sentece:
sentence_1=c(2, 1, 0, 2, 0, 1, 1, 1)
sentence_2=c(2, 1, 1, 1, 1, 0, 1, 1)
crossprod(sentence_1, sentence_2)/sqrt(crossprod(sentence_1) * crossprod(sentence_2))
And the result is... 0.8215838
!
Now imagine we delete the word Julie
from sentence 1. The new vector for sentence 1 is:
sentence_1=c(2, 0, 0, 2, 0, 1, 1, 1)
(2nd element is now 0)
And the new result is...
0.7627701
Conclusion: Deleting the word Julie
causes the sentences to be less similar.
This kind of techinques, allow us to order the data and take a decision quickly.
Mining Twitter
Airplane users used to have many complains about airlines, and they express their dissatisfaction through the popular Twitter.
In this real case Jeffrey Breen scrapes data from twitter, and then apply many text/sentimental mining techniques.
Here, the post.
Do you want to start your own project? Just follow this great tutorial made by Yanchan Zhao. I'm aware this is not new, but someone new to this topic may benefit from this.
One step ahead: Analyzing expressions
Last links showed how to analyze text considering one word at a time, but what about phrases?
For example, the sentence: ***"I don't like to wait in the airport"***.
It's not the same to analyze the correlation between
the words:
- "don't",
- "like",
- "wait"
Than to analyze the correlation between:
- "don't like"
- "wait"
In 1st case, the algorithm may show you a correlation between:
- "don't" and "like"
- "don't" and "wait"
- "like" and "wait" -really? ;)
In 2nd case, the result may be something like:
- "don't like" and "wait"
Much more clear, isn't it?
If you want to consider words as phrases -the 2nd case-, take a look at this answer from stackoverflow.com.
Thanks for reading :)