YoNER: A New Yor\`ub\'a Multi-domain Named Entity Recognition Dataset

Peace Busola Falola; Jesujoba O. Alabi; Solomon O. Akinola; Folashade T. Ogunajo; Emmanuel Oluwadunsin Alabi; David Ifeoluwa Adelani

arXiv:2604.05624·cs.CL·April 8, 2026

YoNER: A New Yor\`ub\'a Multi-domain Named Entity Recognition Dataset

Peace Busola Falola, Jesujoba O. Alabi, Solomon O. Akinola, Folashade T. Ogunajo, Emmanuel Oluwadunsin Alabi, David Ifeoluwa Adelani

PDF

TL;DR

YoNER introduces a comprehensive multi-domain Yorùbá NER dataset and a specialized language model, enhancing NLP research for Yorùbá across diverse domains with high-quality annotations and benchmarks.

Contribution

The paper presents the first multi-domain Yorùbá NER dataset with manual annotations and introduces OyoBERT, a Yorùbá-specific language model, advancing NLP resources for the language.

Findings

01

African-centric models outperform multilingual models for Yorùbá.

02

Cross-domain performance drops significantly in certain domains.

03

Yorùbá-specific models like OyoBERT outperform multilingual models in in-domain tasks.

Abstract

Named Entity Recognition (NER) is a foundational NLP task, yet research in Yor\`ub\'a has been constrained by limited and domain-specific resources. Existing resources, such as MasakhaNER (a manually annotated news-domain corpus) and WikiAnn (automatically created from Wikipedia), are valuable but restricted in domain coverage. To address this gap, we present YoNER, a new multidomain Yor\`ub\'a NER dataset that extends entity coverage beyond news and Wikipedia. The dataset comprises about 5,000 sentences and 100,000 tokens collected from five domains including Bible, Blogs, Movies, Radio broadcast and Wikipedia, and annotated with three entity types: Person (PER), Organization (ORG) and Location (LOC), following CoNLL-style guidelines. Annotation was conducted manually by three native Yor\`ub\'a speakers, with an inter-annotator agreement of over 0.70, ensuring high quality and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.