DBpedia NIF: Open, Large-Scale and Multilingual Knowledge Extraction Corpus
Milan Dojchinovski, Julio Hernandez, Markus Ackermann, Amit, Kirschenbaum, Sebastian Hellmann

TL;DR
This paper introduces DBpedia NIF, a large-scale, multilingual corpus derived from Wikipedia articles, designed to enhance structured data extraction and support NLP and IR research across 128 languages.
Contribution
It presents a new multilingual dataset in NLP Interchange Format, expanding DBpedia's structured information and providing a resource for diverse NLP and IR applications.
Findings
Dataset covers all articles in 128 Wikipedia languages.
Enriched with 25% more links and selected partitions as Linked Data.
Supports NLP and IR tasks with a large-scale multilingual resource.
Abstract
In the past decade, the DBpedia community has put significant amount of effort on developing technical infrastructure and methods for efficient extraction of structured information from Wikipedia. These efforts have been primarily focused on harvesting, refinement and publishing semi-structured information found in Wikipedia articles, such as information from infoboxes, categorization information, images, wikilinks and citations. Nevertheless, still vast amount of valuable information is contained in the unstructured Wikipedia article texts. In this paper, we present DBpedia NIF - a large-scale and multilingual knowledge extraction corpus. The aim of the dataset is two-fold: to dramatically broaden and deepen the amount of structured information in DBpedia, and to provide large-scale and multilingual language resource for development of various NLP and IR task. The dataset provides the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Wikis in Education and Collaboration · Topic Modeling
