Do CoNLL-2003 Named Entity Taggers Still Work Well in 2023?
Shuheng Liu, Alan Ritter

TL;DR
This study evaluates how well NER models trained on the 20-year-old CoNLL-2003 dataset perform on modern data, revealing that recent pre-trained Transformers maintain strong performance and that evaluation methods may underestimate progress.
Contribution
The paper provides a comprehensive analysis of NER model generalization over time, highlighting factors influencing performance and challenging assumptions about dataset relevance.
Findings
Pre-trained Transformers like RoBERTa and T5 do not degrade in performance over decades.
Model architecture, parameters, and pre-training data period are key to generalization.
NER models have improved more on modern data than on the original test set.
Abstract
The CoNLL-2003 English named entity recognition (NER) dataset has been widely used to train and evaluate NER models for almost 20 years. However, it is unclear how well models that are trained on this 20-year-old data and developed over a period of decades using the same test set will perform when applied on modern data. In this paper, we evaluate the generalization of over 20 different models trained on CoNLL-2003, and show that NER models have very different generalization. Surprisingly, we find no evidence of performance degradation in pre-trained Transformers, such as RoBERTa and T5, even when fine-tuned using decades-old data. We investigate why some models generalize well to new data while others do not, and attempt to disentangle the effects of temporal drift and overfitting due to test reuse. Our analysis suggests that most deterioration is due to temporal mismatch between the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Biomedical Text Mining and Ontologies
MethodsGated Linear Unit · Multi-Head Attention · Attention Is All You Need · Linear Warmup With Linear Decay · Weight Decay · Adam · WordPiece · BERT · RoBERTa · Test
