WikiAtomicEdits: A Multilingual Corpus of Wikipedia Edits for Modeling   Language and Discourse

Manaal Faruqui; Ellie Pavlick; Ian Tenney; Dipanjan Das

arXiv:1808.09422·cs.CL·August 29, 2018·1 cites

WikiAtomicEdits: A Multilingual Corpus of Wikipedia Edits for Modeling Language and Discourse

Manaal Faruqui, Ellie Pavlick, Ian Tenney, Dipanjan Das

PDF

Open Access

TL;DR

This paper introduces a large multilingual corpus of Wikipedia edits, highlighting how editing language differs from standard text and how models trained on edits capture unique semantic and discourse features.

Contribution

The authors release a novel corpus of 43 million Wikipedia edits across 8 languages, enabling new research in semantics, discourse, and representation learning.

Findings

01

Editing language differs from standard corpora

02

Models trained on edits encode different semantic aspects

03

The corpus supports research in semantics and discourse

Abstract

We release a corpus of 43 million atomic edits across 8 languages. These edits are mined from Wikipedia edit history and consist of instances in which a human editor has inserted a single contiguous phrase into, or deleted a single contiguous phrase from, an existing sentence. We use the collected data to show that the language generated during editing differs from the language that we observe in standard corpora, and that models trained on edits encode different aspects of semantics and discourse than models trained on raw, unstructured text. We release the full corpus as a resource to aid ongoing research in semantics, discourse, and representation learning.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Wikis in Education and Collaboration