Mining Streams of News and Blogs

News, blogs and web forums are a continuous unstructured source of valuable information. The aim of this infrastructure is to

  • Monitor the relevant sources,
  • Acquire, clean and process the unstructured data,
  • Extract information and knowledge in form of networks.

Acquiring Unstructured Data

News, blogs and web forums are a continuous unstructured source of valuable information. We are continuously retrieving the unstructured data sources (i.e., streams of textual documents) from publicly available big data web sources. The data acquisition pipeline, adopted from the European project FIRST and adapted according to the specific needs of SIMPOL, is responsible for acquiring unstructured data from several data sources and preparing it for the analysis. Pre-processing steps include removing

  • boilerplate,
  • duplicates,
  • spam,
  • unsupported languages,
  • off-topic documents,
  • other irrelevant content.

The data acquisition pipeline is running continuously since October 2011, polling the Web and relevant public APIs for new content, turning it into a stream of pre-processed text documents.

Network extraction from streams of text

Extracting networks from text involves structuring textual information, which means annotating the unstructured text with objects and links between the objects. From the document stream, we will identify socio-economic objects and discover relationships between them (e.g., “owns”, “supports”, “opposes”, “does business with”). This will allow us to extend existing (explicit) networks with additional relations that were initially not available. To achieve this goal, we will resort to the techniques also used in ontology learning, involving natural-language processing techniques such as subject-predicate-object triple extraction, and machine learning techniques like, relation classification and context clustering.