Wikipedia Citations: Reproducible Citation Extraction from Multilingual Wikipedia
Natallia Kokash, Giovanni Colavizza

TL;DR
This paper presents a reproducible, open-source pipeline for extracting, processing, and translating citations from Wikipedia in multiple languages, facilitating integration with open science initiatives.
Contribution
It introduces a scalable, cloud-based pipeline capable of extracting millions of citations from Wikipedia across multiple languages, with a focus on reproducibility and open data.
Findings
Extracted 29.3 million citations from English Wikipedia in 2020
Processed over 40 million citations in 2023 and 2024
Supported 15 European languages with citation template translation
Abstract
Wikipedia is an essential component of the open science ecosystem, yet it is poorly integrated with academic open science initiatives. Wikipedia Citations is a project that focuses on extracting and releasing comprehensive datasets of citations from Wikipedia. A total of 29.3 million citations were extracted from English Wikipedia in May 2020. Following this one-off research project, we designed a reproducible pipeline that can process any given Wikipedia dump in the cloud-based settings. To demonstrate its usability, we extracted 40.6 million citations in February 2023 and 44.7 million citations in February 2024. Furthermore, we equipped the pipeline with an adapted Wikipedia citation template translation module to process multilingual Wikipedia articles in 15 European languages so that they are parsed and mapped into a generic structured citation template. This paper presents our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWikis in Education and Collaboration · Natural Language Processing Techniques · Biomedical Text Mining and Ontologies
