Much of the data proliferating is unstructured and text-heavy. Today, bridging unstructured to structured data is crucial.

For my Text Analytics and Natural Language Processing project, I had to find an area of interest, collect, and analyze text data in R Studio using 2 different frameworks. 

Hence, I decided to analyze text data regarding my favorite soccer team – A.C. Milan. I used the TwitterR library to find out what fans from around the world think about the team.

First, I downloaded the most recent thousand tweets in English and put the data set in a data frame. Then, I tokenized the data frame, eliminated stop words, counted the most popular words, and created a histogram.

 

Picture1
Picture2

From the list and the graph above, we can get some insight into what is currently going on with A.C. Milan:

· The team is currently playing to win the major Italian soccer league – Serie A.

· The last two matches were against Sassuolo, Genoa, and Salernitana.

· The players with the most rumors around are Zlatan Ibrahimovic, Kjaer, and Messias.

 

 

#SENTIMENT ANALYSIS

The first framework I decided to use is Sentiment Analysis – after comparing the different libraries, I wanted to find out – through the Bing library - which are the most common positive and negative words for the team.

Picture3

From the graph above, we can gain some useful insights comparing the positive and negative words spoken around the team. We can see how the number of positive words exceeds the number of negative ones. In fact, the top 10 positives have a way higher contribution to sentiment. From the words, we can deduct that A.C. Milan is currently at the top of the table winning and defending the title, fans and players are happy. On the other hand, one of our best defenders just had a major knee injury and will be out of the field for the next 8 months – a big loss for the team.

#N-GRAMS ANALYSIS

The second framework I decided to use is N-grams analysis – I wanted to find out which are the most common “pairs”, words that appear together.

 

Picture4

 

From the bigram graph above, we can gain other useful business insights. The most connected network is about the last games that A.C. Milan won against Lazio, Roma, Salernitana, Genoa, Atalanta, etc. Also from other nests, we can see that the coach – Stefano Pioli – just extended his contract with the club; or that this upcoming week there’s a fundamental match against Liverpool to continue our campaign in European Champions League.

 

#CONCLUSION 

Overall, from the text data collected and from comparing the two frameworks I chose for my analysis, we can deduct that A.C. Milan is having a very positive moment and need to stay on track to win the championship.

#sempreforzaMilan!

 

 

Selected Works

DonenziLiveStartup Founder

AirFrance SEM Campaign OptimizationData Science Team Project - R Studio

#A.C. Milan Tweets AnalysisText Analytics & NLP - R Studio