Does Synthetic Data Help Named Entity Recognition for Low-Resource Languages?
Gaurav Kamath, Sowmya Vajjala

TL;DR
This paper investigates the effectiveness of synthetic data augmentation in improving Named Entity Recognition for low-resource languages across 11 diverse languages, showing promising results with notable variation.
Contribution
It provides an empirical evaluation of synthetic data's impact on low-resource multilingual NER, highlighting its potential and language-specific differences.
Findings
Synthetic data improves NER performance in low-resource languages
Significant variation in effectiveness across different languages
Synthetic data shows promise as a data augmentation technique
Abstract
Named Entity Recognition(NER) for low-resource languages aims to produce robust systems for languages where there is limited labeled training data available, and has been an area of increasing interest within NLP. Data augmentation for increasing the amount of low-resource labeled data is a common practice. In this paper, we explore the role of synthetic data in the context of multilingual, low-resource NER, considering 11 languages from diverse language families. Our results suggest that synthetic data does in fact hold promise for low-resource language NER, though we see significant variation between languages.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Web Data Mining and Analysis · Natural Language Processing Techniques
