Does Synthetic Data Help Named Entity Recognition for Low-Resource Languages?

Gaurav Kamath; Sowmya Vajjala

arXiv:2505.16814·cs.CL·February 16, 2026

Does Synthetic Data Help Named Entity Recognition for Low-Resource Languages?

Gaurav Kamath, Sowmya Vajjala

PDF

Open Access

TL;DR

This paper investigates the effectiveness of synthetic data augmentation in improving Named Entity Recognition for low-resource languages across 11 diverse languages, showing promising results with notable variation.

Contribution

It provides an empirical evaluation of synthetic data's impact on low-resource multilingual NER, highlighting its potential and language-specific differences.

Findings

01

Synthetic data improves NER performance in low-resource languages

02

Significant variation in effectiveness across different languages

03

Synthetic data shows promise as a data augmentation technique

Abstract

Named Entity Recognition(NER) for low-resource languages aims to produce robust systems for languages where there is limited labeled training data available, and has been an area of increasing interest within NLP. Data augmentation for increasing the amount of low-resource labeled data is a common practice. In this paper, we explore the role of synthetic data in the context of multilingual, low-resource NER, considering 11 languages from diverse language families. Our results suggest that synthetic data does in fact hold promise for low-resource language NER, though we see significant variation between languages.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Web Data Mining and Analysis · Natural Language Processing Techniques