TL;DR
This paper introduces LongtoNotes, a new, longer coreference resolution corpus derived from Ontonotes, enabling better evaluation and understanding of model performance on lengthy documents across multiple genres.
Contribution
The work provides a manually-curated, longer document corpus for coreference resolution, addressing limitations of previous datasets and facilitating research on long-document modeling.
Findings
State-of-the-art models show performance drops on longer documents.
Model architecture and hyperparameters significantly affect performance and efficiency.
The new corpus reveals specific challenges in long-document coreference resolution.
Abstract
Ontonotes has served as the most important benchmark for coreference resolution. However, for ease of annotation, several long documents in Ontonotes were split into smaller parts. In this work, we build a corpus of coreference-annotated documents of significantly longer length than what is currently available. We do so by providing an accurate, manually-curated, merging of annotations from documents that were split into multiple parts in the original Ontonotes annotation process. The resulting corpus, which we call LongtoNotes contains documents in multiple genres of the English language with varying lengths, the longest of which are up to 8x the length of documents in Ontonotes, and 2x those in Litbank. We evaluate state-of-the-art neural coreference systems on this new corpus, analyze the relationships between model architectures/hyperparameters and document length on performance and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
