NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages
Samuel Cahyawijaya, Holy Lovenia, Fajri Koto, Dea Adhista, Emmanuel, Dave, Sarah Oktavianti, Salsabil Maulana Akbar, Jhonson Lee, Nuur Shadieq,, Tjeng Wawan Cenggoro, Hanung Wahyuning Linuwih, Bryan Wilie, Galih Pradipta, Muridan, Genta Indra Winata, David Moeljadi

TL;DR
This paper introduces NusaWrites, a high-quality dataset for underrepresented Indonesian languages, demonstrating that native speaker paragraph writing improves lexical diversity and cultural relevance, and highlights the need for multilingual models to support these languages.
Contribution
It presents a new dataset, NusaWrites, and compares data collection methods, showing native speaker writing yields superior quality for low-resource languages.
Findings
Native speaker paragraph writing enhances lexical diversity.
The dataset captures cultural relevance effectively.
Existing multilingual models need extension for underrepresented languages.
Abstract
Democratizing access to natural language processing (NLP) technology is crucial, especially for underrepresented and extremely low-resource languages. Previous research has focused on developing labeled and unlabeled corpora for these languages through online scraping and document translation. While these methods have proven effective and cost-efficient, we have identified limitations in the resulting corpora, including a lack of lexical diversity and cultural relevance to local communities. To address this gap, we conduct a case study on Indonesian local languages. We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets. Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content. In addition, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
