Extracting Networks from Open Data

Open data is the idea that certain data should be freely available to everyone to use and republish as they wish, without restrictions from copyright, patents or other mechanisms of control. We have designed a general purpose workflow which takes any instance of Linked Open Data (LOD) in the standard RDF formalism, and transforms it into the graph-based Neo4j database. The workflow and its implementation is open source, freely available on GitHub (the main open source software repository). We have (semi)automatically extracted networks and inserted them into the Neo4j database, from the following sources:

  • DBpedia - structured subset of Wikipedia
  • reegle – renewable energy resources
  • European Parliament – members and political groups
  • Climate change sceptics - U.S. climate change counter-movement organizations.

The sources represent various types of data of interest to the Simpol project. DBpedia is a very large, general purpose, structured LOD part of Wikipedia. reglee is also in the LOD format, but specialized to renewable resources and related organizations. Data about the European parliament is relevant for networks of influence, but not in the standard LOD format. Also, climate sceptics data is relevant for networks of influence, and was extracted from supplementary material of an article.

Network Extraction from Linked Open Data Workflow

We have developed a generic workflow which transforms LOD into a graph database. The workflow takes LOD in the RDF format as the input, extracts the network in a semi-automated user-supervised fashion, and stores it in a Neo4j graph database. The workflow is depicted in Figure 1. The project source is available at public GitHub repository Lod2neo4j.

Network extraction from Linked Open Data workflow
Figure 1: Network extraction from Linked Open Data workflow

Insert Labels

The LOD data usually comes in two parts: the ontology and the data itself. We use the ontology as metadata and insert into Neo4j the RDF types as possible labels for nodes.

The workflow for network extraction from linked open data generally starts from an empty Neo4j database. This speeds up the insert procedure because there is no need to check and resolve data conflicts (unique URI constraint). If the database is empty, the network extraction from linked open data can also control the order of inserted node labels, which influences the Neo4j shell visualization customization. The “Insert Labels” procedure is optional, but recommended if the order of labels is important to the user.

LOD inspect

The procedure “LOD inspect” takes and analyses the RDF data and generates a file named “xx-lod2neo.n3”. This automatically generated file describes all the predicates from the RDF data, categorizes them as “relations” or “node properties” in Neo4j and names them according to the naming convention described in the previous section. The script generates a report with examples for each relation and property type. The file “xx-lod2neo.n3” can be manually checked and modified by changing relationships to properties and vice versa, rename the relationships/properties and by removing relationships and properties the user does not want to include in the Neo4j graph.

LOD insert

Two different procedures have been implemented to insert the linked open data into Neo4j. One procedure is general-purpose, checking for data conflicts (e.g., violation of the unique URI constraint), solving the conflicts or issuing warnings, the other procedure assumes the database is empty and ensures consistency within the inserted data. This second procedure is significantly faster compared to the general-purpose one, which is especially important when inserting large sets of linked open data, like DBpedia.

Both procedures take LOD files in common LOD formats like nt, n3, RDF and the (reviewed) mapping file “xx-lod2neo.n3” generated by LOD_inspect. The data is then loaded into main memory and systematically inserted into neo4j. The insert procedures include transactions, retry on failure and other features that make it reliable even for large datasets.

The Linked Open Data Network

We have used the workflow presented in Figure1 to insert the data from the following sources into the Simpol knowledge base: DBpedia and reegle. The inserted data includes 3.7 million nodes and 12.7 million relations. The insert procedure duration is approximately 3 hours (if the database is empty). An example of the extracted graph that merges data from both DBpedia and reegle is depicted in Figure 2.

Network extracted from Linked Open Data: Data about Greenpeace
Figure 2: Network extracted from Linked Open Data: Data about Greenpeace

Network Extraction from (not Linked) Open Data

While the DBpedia and reegle are 5-star linked open data, there are other sources of open data that are relevant to the SIMPOL project, but unfortunately not adherent to the 5-star LOD standards. One example of such data is from the European Parliament page, where the list of European parliament members with their political groups, national political groups and countries is available in the XML format. This is a 3-star open data, as it is free, machine readable and in an open format.

The Europpean parliament data was imported into the Simpol knowledge base. Then the Simpol portal option to export to GrpahML was used and the result visualized in Gephi, a an interactive visualization and exploration platform for networks. The resulting graph is depicted in Figure 3.

Gephi visualization of the European parliament
Figure 3: Gephi visualization of the European parliament