OpenNER 1.0: Standardized Open-Access Named Entity Recognition Datasets in 50+ Languages
Chester Palen-Michel, Maxwell Pickering, Maya Kruse, Jonne S\"alev\"a, and Constantine Lignos

TL;DR
OpenNER 1.0 offers a comprehensive, standardized collection of NER datasets across 52 languages, enabling improved multilingual NER research and benchmarking with baseline model results.
Contribution
It introduces a unified, corrected, and standardized multilingual NER dataset collection with consistent annotations, facilitating future research and model benchmarking.
Findings
No single model outperforms others across all languages.
Large language models still have significant room for improvement in NER.
Standardization enables fairer and more effective multilingual NER research.
Abstract
We present OpenNER 1.0, a standardized collection of openly-available named entity recognition (NER) datasets. OpenNER contains 36 NER corpora that span 52 languages, human-annotated in varying named entity ontologies. We correct annotation format issues, standardize the original datasets into a uniform representation with consistent entity type names across corpora, and provide the collection in a structure that enables research in multilingual and multi-ontology NER. We provide baseline results using three pretrained multilingual language models and two large language models to compare the performance of recent models and facilitate future research in NER. We find that no single model is best in all languages and that significant work remains to obtain high performance from LLMs on the NER task. OpenNER is released at https://github.com/bltlab/open-ner.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Data Quality and Management
