YoNER: A New Yor\`ub\'a Multi-domain Named Entity Recognition Dataset
Peace Busola Falola, Jesujoba O. Alabi, Solomon O. Akinola, Folashade T. Ogunajo, Emmanuel Oluwadunsin Alabi, David Ifeoluwa Adelani

TL;DR
YoNER introduces a comprehensive multi-domain Yorùbá NER dataset and a specialized language model, enhancing NLP research for Yorùbá across diverse domains with high-quality annotations and benchmarks.
Contribution
The paper presents the first multi-domain Yorùbá NER dataset with manual annotations and introduces OyoBERT, a Yorùbá-specific language model, advancing NLP resources for the language.
Findings
African-centric models outperform multilingual models for Yorùbá.
Cross-domain performance drops significantly in certain domains.
Yorùbá-specific models like OyoBERT outperform multilingual models in in-domain tasks.
Abstract
Named Entity Recognition (NER) is a foundational NLP task, yet research in Yor\`ub\'a has been constrained by limited and domain-specific resources. Existing resources, such as MasakhaNER (a manually annotated news-domain corpus) and WikiAnn (automatically created from Wikipedia), are valuable but restricted in domain coverage. To address this gap, we present YoNER, a new multidomain Yor\`ub\'a NER dataset that extends entity coverage beyond news and Wikipedia. The dataset comprises about 5,000 sentences and 100,000 tokens collected from five domains including Bible, Blogs, Movies, Radio broadcast and Wikipedia, and annotated with three entity types: Person (PER), Organization (ORG) and Location (LOC), following CoNLL-style guidelines. Annotation was conducted manually by three native Yor\`ub\'a speakers, with an inter-annotator agreement of over 0.70, ensuring high quality and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
