NewsEdits: A Dataset of Revision Histories for News Articles (Technical Report: Data Processing)
Alexander Spangher, Jonathan May

TL;DR
NewsEdits is the first large-scale, multilingual dataset of news article revision histories, enabling new research in linguistics and social sciences with over 1.2 million articles and 72 million atomic edits.
Contribution
The paper introduces NewsEdits, the largest publicly available dataset of news article revisions, covering multiple languages and extensive revision details.
Findings
Contains 1,278,804 articles and 4.6 million versions.
Includes 72 million atomic edits derived from sentence changes.
Enables novel research in linguistics and social sciences.
Abstract
News article revision histories have the potential to give us novel insights across varied fields of linguistics and social sciences. In this work, we present, to our knowledge, the first publicly available dataset of news article revision histories, or NewsEdits. Our dataset is multilingual; it contains 1,278,804 articles with 4,609,430 versions from over 22 English- and French-language newspaper sources based in three countries. Across version pairs, we count 10.9 million added sentences; 8.9 million changed sentences and 6.8 million removed sentences. Within the changed sentences, we derive 72 million atomic edits. NewsEdits is, to our knowledge, the largest corpus of revision histories of any domain.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Web Data Mining and Analysis
