Wiki-Quantities and Wiki-Measurements: Datasets of quantities and their measurement context from Wikipedia
Jan Göpfert, Patrick Kuckertz, Jann M. Weinand, Detlef Stolten

TL;DR
This paper introduces two large datasets from Wikipedia for identifying quantities and their measurement context, which can help improve data extraction in scientific and engineering fields.
Contribution
The paper presents two novel datasets, Wiki-Quantities and Wiki-Measurements, for identifying and contextualizing quantities in text.
Findings
Wiki-Quantities contains over 1.2 million annotated quantities from English Wikipedia.
Wiki-Measurements includes 38,738 annotated quantities with their measured entities and properties.
Manual validation showed high accuracy for both datasets, with 100% for Wiki-Quantities and 84-94% for Wiki-Measurements.
Abstract
To cope with the large number of publications, more and more researchers are automatically extracting data of interest using natural language processing methods based on supervised learning. Much data, especially in the natural and engineering sciences, is quantitative, but there is a lack of datasets for identifying quantities and their context in text. To address this issue, we present two large datasets based on Wikipedia and Wikidata: Wiki-Quantities is a dataset consisting of over 1.2 million annotated quantities in the English-language Wikipedia. Wiki-Measurements is a dataset of 38 738 annotated quantities in the English-language Wikipedia along with their respective measured entity, property, and optional qualifiers. Manual validation of 100 samples each of Wiki-Quantities and Wiki-Measurements found 100% and 84-94% correct, respectively. The datasets can be used in pipeline…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWikis in Education and Collaboration · Natural Language Processing Techniques · Topic Modeling
