Piecing Together Cross-Document Coreference Resolution Datasets: Systematic Dataset Analysis and Unification
Anastasia Zhukova, Terry Ruas, Jan Philip Wahle, Bela Gipp

TL;DR
This paper introduces uCDCR, a unified, standardized dataset for cross-document coreference resolution, enabling more consistent, fair, and comprehensive analysis and model training across diverse datasets.
Contribution
The paper presents uCDCR, a consolidated dataset with standardized formats, annotations, and metrics, facilitating reproducible research and cross-dataset analysis in CDCR.
Findings
ECB+ has low lexical diversity among datasets.
Using all uCDCR datasets improves model generalizability.
Resolving entity and event coreference remains a complex task.
Abstract
Research in CDCR remains fragmented due to heterogeneous dataset formats, varying annotation standards, and the predominance of the CDCR definition as the event coreference resolution (ECR). To address these challenges, we introduce uCDCR, a unified dataset that consolidates diverse publicly available English CDCR corpora across various domains into a consistent format, which we analyze with standardized metrics and evaluation protocols. uCDCR incorporates both entity and event coreference, corrects known inconsistencies, and enriches datasets with missing attributes to facilitate reproducible research. We establish a cohesive framework for fair, interpretable, and cross-dataset analysis in CDCR and compare the datasets on their lexical properties, e.g., lexical composition of the annotated mentions, lexical diversity and ambiguity metrics, discuss the annotation rules and principles…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Biomedical Text Mining and Ontologies
