Working out how to analyse text data is one of the most difficult problems in data analysis. To get your head around how difficult it is, look at the table below. This shows the answers of 20 people to the question “What do you dislike about Tom Cruise?”
|Nothing I really like him.||nothing|
|he stuck on him self||He’s good actor, but hes a real jerk|
|he is gay||his ego|
|Nothing||his scientology beliefs|
|He doesn’t seem very genuine.||nothing|
|He thinks is too cool||not sure|
|i hate everything about him. he sucks as an actor||everything|
|I think he is a great actor but not a very good husband||his apparant (whether real or not) arrogance; Scienetology|
|I dont like that he acts like he is better than everyone else.||everything|
These 20 answers are from a survey of 300 people. Most people would not have the patience to read through all 300. But this is a drop in the bucket compared to some of the text databases. How would you go about summarizing all of the comments about Tom Cruise in social media?
The simplest solution to this problem is to count up how frequently each word appears. As commonly there are hundreds and thousands of different words this leads to a different problem, which is how to create an easy-to-read table or chart which shows all of this data. The only good solution to this problem is the word cloud (also known as a tag cloud). There are many free word cloud services out there and, at the moment, these are the only useful text analytics tools that can be obtained for free.
The most famous of all the free word cloud tools is Wordle. Despite being one of the oldest of the tools it remains one of the best, giving the user a good level of control over things like colors and fonts. Wordle’s word cloud for the 300 peoples’ reasons for disliking Tom Cruise is shown below.
A practical problem with this word cloud is that it is filled with a lot of irrelevant information. The most prominent word in this cloud is Nothing, which is arguably not even an answer to the question. Other prominent words on the cloud are things like Tom and Cruise, which are words that people understandably typed but again not helpful if wanting to understand what it is that people dislike about Tom Cruise. And, if you look really carefully you will see that there are lots of words relating to religion and scientology, but because different words have been used none of them stand out.
DataCracker solves some of these practical problems with word clouds. It permits the user to exclude words (such as Nothing) and to manually combine words that are similar, be that in a literal sense (e.g., ‘scientology’ and ‘scientologist’) or in terms of their meaning in the context of the data being analyzed (‘scientology’ and ‘religion’). The resulting word cloud, shown below, gives a much clearer answer to the question of why people dislike Tom Cruise.
An alternative improvement to word clouds is to give users additional control over the appearance of the word cloud. The word cloud below has been created by Tagxedo, and uses a photo of Tom Cruise as a template for the word cloud.
Post your questions, thoughts or comments in the comment section below.