Annotation Errors and NER: A Study with OntoNotes 5.0

Gabriel Bernier-Colborne; Sowmya Vajjala

arXiv:2406.19172·cs.CL·June 28, 2024

Annotation Errors and NER: A Study with OntoNotes 5.0

Gabriel Bernier-Colborne, Sowmya Vajjala

PDF

Open Access

TL;DR

This study identifies and corrects annotation errors in the OntoNotes 5.0 NER dataset using simple techniques, leading to improved model performance and demonstrating the importance of dataset quality in NLP tasks.

Contribution

The paper introduces three simple, largely language-agnostic techniques for detecting annotation errors in large NER datasets, significantly improving data quality and model accuracy.

Findings

01

Corrected ~10% of sentences in OntoNotes 5.0

02

Improved NER model F-scores by 1.23% on average

03

Large improvements (>10%) for some entity types

Abstract

Named Entity Recognition (NER) is a well-studied problem in NLP. However, there is much less focus on studying NER datasets, compared to developing new NER models. In this paper, we employed three simple techniques to detect annotation errors in the OntoNotes 5.0 corpus for English NER, which is the largest available NER corpus for English. Our techniques corrected ~10% of the sentences in train/dev/test data. In terms of entity mentions, we corrected the span and/or type of ~8% of mentions in the dataset, while adding/deleting/splitting/merging a few more. These are large numbers of changes, considering the size of OntoNotes. We used three NER libraries to train, evaluate and compare the models trained with the original and the re-annotated datasets, which showed an average improvement of 1.23% in overall F-scores, with large (>10%) improvements for some of the entity types. While our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Semantic Web and Ontologies

MethodsFocus