The Impact of Data Corruption on Named Entity Recognition for Low-resourced Languages
Manuel Fokam, Michael Beukman

TL;DR
This paper systematically examines how data quantity and quality affect named entity recognition in low-resource languages, revealing that fewer fully-labeled sentences outperform more incomplete data and models can perform well with minimal data.
Contribution
It provides a comprehensive analysis of data quality and quantity impacts on NER performance in low-resource languages, highlighting the importance of complete labeling over sheer data volume.
Findings
Fewer fully-labeled sentences outperform more incomplete data.
Models perform well with only 10% of training data.
Results are consistent across multiple languages and models.
Abstract
Data availability and quality are major challenges in natural language processing for low-resourced languages. In particular, there is significantly less data available than for higher-resourced languages. This data is also often of low quality, rife with errors, invalid text or incorrect annotations. Many prior works focus on dealing with these problems, either by generating synthetic data, or filtering out low-quality parts of datasets. We instead investigate these factors more deeply, by systematically measuring the effect of data quantity and quality on the performance of pre-trained language models in a low-resourced setting. Our results show that having fewer completely-labelled sentences is significantly better than having more sentences with missing labels; and that models can perform remarkably well with only 10% of the training data. Importantly, these results are consistent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
