OpenNER 1.0: Standardized Open-Access Named Entity Recognition Datasets in 50+ Languages

Chester Palen-Michel; Maxwell Pickering; Maya Kruse; Jonne S\"alev\"a; and Constantine Lignos

arXiv:2412.09587·cs.CL·December 19, 2025

OpenNER 1.0: Standardized Open-Access Named Entity Recognition Datasets in 50+ Languages

Chester Palen-Michel, Maxwell Pickering, Maya Kruse, Jonne S\"alev\"a, and Constantine Lignos

PDF

Open Access 2 Datasets 1 Video

TL;DR

OpenNER 1.0 offers a comprehensive, standardized collection of NER datasets across 52 languages, enabling improved multilingual NER research and benchmarking with baseline model results.

Contribution

It introduces a unified, corrected, and standardized multilingual NER dataset collection with consistent annotations, facilitating future research and model benchmarking.

Findings

01

No single model outperforms others across all languages.

02

Large language models still have significant room for improvement in NER.

03

Standardization enables fairer and more effective multilingual NER research.

Abstract

We present OpenNER 1.0, a standardized collection of openly-available named entity recognition (NER) datasets. OpenNER contains 36 NER corpora that span 52 languages, human-annotated in varying named entity ontologies. We correct annotation format issues, standardize the original datasets into a uniform representation with consistent entity type names across corpora, and provide the collection in a structure that enables research in multilingual and multi-ontology NER. We provide baseline results using three pretrained multilingual language models and two large language models to compare the performance of recent models and facilitate future research in NER. We find that no single model is best in all languages and that significant work remains to obtain high performance from LLMs on the NER task. OpenNER is released at https://github.com/bltlab/open-ner.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

OpenNER 1.0: Standardized Open-Access Named Entity Recognition Datasets in 50+ Languages· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Data Quality and Management