CleanCoNLL: A Nearly Noise-Free Named Entity Recognition Dataset
Susanna R\"ucker, Alan Akbik

TL;DR
CleanCoNLL is a meticulously relabeled, near noise-free NER dataset that enables more accurate evaluation of model performance and error analysis, surpassing previous datasets in quality.
Contribution
We provide a thoroughly relabeled version of CoNLL-03 with enhanced annotation consistency and entity linking, significantly reducing noise and improving evaluation reliability.
Findings
State-of-the-art models achieve 97.1% F1 on CleanCoNLL.
Annotation noise in previous datasets caused 47% of correct predictions to be misclassified as errors.
CleanCoNLL enables more precise analysis of model errors and upper performance bounds.
Abstract
The CoNLL-03 corpus is arguably the most well-known and utilized benchmark dataset for named entity recognition (NER). However, prior works found significant numbers of annotation errors, incompleteness, and inconsistencies in the data. This poses challenges to objectively comparing NER approaches and analyzing their errors, as current state-of-the-art models achieve F1-scores that are comparable to or even exceed the estimated noise level in CoNLL-03. To address this issue, we present a comprehensive relabeling effort assisted by automatic consistency checking that corrects 7.0% of all labels in the English CoNLL-03. Our effort adds a layer of entity linking annotation both for better explainability of NER labels and as additional safeguard of annotation quality. Our experimental evaluation finds not only that state-of-the-art approaches reach significantly higher F1-scores (97.1%) on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Data Quality and Management · Natural Language Processing Techniques
