WEC: Deriving a Large-scale Cross-document Event Coreference dataset from Wikipedia
Alon Eirew, Arie Cattan, Ido Dagan

TL;DR
This paper introduces WEC, a large-scale cross-document event coreference dataset derived from Wikipedia, along with a baseline model that outperforms previous methods in this task.
Contribution
The paper presents a novel, scalable methodology for creating large cross-document coreference datasets from Wikipedia, and provides a baseline model adapted from within-document coreference techniques.
Findings
The WEC dataset is significantly larger than existing corpora.
The baseline model outperforms previous state-of-the-art results.
The dataset creation method is adaptable to multiple languages.
Abstract
Cross-document event coreference resolution is a foundational task for NLP applications involving multi-text processing. However, existing corpora for this task are scarce and relatively small, while annotating only modest-size clusters of documents belonging to the same topic. To complement these resources and enhance future research, we present Wikipedia Event Coreference (WEC), an efficient methodology for gathering a large-scale dataset for cross-document event coreference from Wikipedia, where coreference links are not restricted within predefined topics. We apply this methodology to the English Wikipedia and extract our large-scale WEC-Eng dataset. Notably, our dataset creation method is generic and can be applied with relatively little effort to other Wikipedia languages. To set baseline results, we develop an algorithm that adapts components of state-of-the-art models for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text and Document Classification Technologies
