Twitter is an online social networking and micro blogging service that enables its users to send and read text-based messages of up to 140 characters, known as "tweets". It can be seen as the instantaneous public social impulse. In reflects public opinion on a much shorter time scale then news sites. Tweets are short, written in informal language and full of colloquialism and abbreviations and with a strong tendency to be sarcastic. The language used on Twitter is different from traditional texts and requires specific preprocessing steps: detection of emoticons, removal of URLs, expansion of abbreviations, etc.
We are monitoring Twitter and analyzing the volumes, trends and sentiments about relevant socio-economic objects to enrich the properties of the existing networks.
We are implementing a real-time adaptive hierarchical clustering algorithm to identify main topics in a stream of tweets, to detect emerging and diminishing topics, and to track the evolution of specific topics through time. Twitter topic tracking will be used to detect emerging trends in a domain.
For sentiment classification, a number of features have to be generated from tweets, such as bag-of-words, n-grams, part-of-speech tags, and Twitter specific features: hashtags, punctuation and character repetitions, and capitalized words. Language used in tweets has to be detected and then language specific components invoked. We are developing sentiment classifiers to label tweets’ polarities as “positive”, “negative” or “neutral” by using a specialized SVM (with a margin around the hyperplane to assign the “neutral” label). We start by training language-independent classifiers by using emoticons as approximate labels. In the next stage, we will enrich the training features by language-specific, sentiment-bearing lexicon.