Distributed extraction of Wikipedia data dumps for DBpedia
by Nilesh Chakraborty for DBpedia & DBpedia Spotlight
The DBpedia project “extracts structured, multilingual knowledge from Wikipedia and makes it freely available on the Web using Semantic Web and Linked Data technologies”. Large-scale data processing can be given a big performance boost if it is distributed over a cluster of computers. The aim of this project is to parallelize the download of Wikipedia dumps using different tools, and distribute their extraction using Apache Spark over multiple machines to ensure speed and scalability.