NusaWrites: Constructing High-Quality Corpora for Underrepresented and   Extremely Low-Resource Languages

Samuel Cahyawijaya; Holy Lovenia; Fajri Koto; Dea Adhista; Emmanuel; Dave; Sarah Oktavianti; Salsabil Maulana Akbar; Jhonson Lee; Nuur Shadieq,; Tjeng Wawan Cenggoro; Hanung Wahyuning Linuwih; Bryan Wilie; Galih Pradipta; Muridan; Genta Indra Winata; David Moeljadi; Alham Fikri Aji; Ayu; Purwarianti; Pascale Fung

arXiv:2309.10661·cs.CL·September 21, 2023

NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages

Samuel Cahyawijaya, Holy Lovenia, Fajri Koto, Dea Adhista, Emmanuel, Dave, Sarah Oktavianti, Salsabil Maulana Akbar, Jhonson Lee, Nuur Shadieq,, Tjeng Wawan Cenggoro, Hanung Wahyuning Linuwih, Bryan Wilie, Galih Pradipta, Muridan, Genta Indra Winata, David Moeljadi

PDF

Open Access 1 Repo

TL;DR

This paper introduces NusaWrites, a high-quality dataset for underrepresented Indonesian languages, demonstrating that native speaker paragraph writing improves lexical diversity and cultural relevance, and highlights the need for multilingual models to support these languages.

Contribution

It presents a new dataset, NusaWrites, and compares data collection methods, showing native speaker writing yields superior quality for low-resource languages.

Findings

01

Native speaker paragraph writing enhances lexical diversity.

02

The dataset captures cultural relevance effectively.

03

Existing multilingual models need extension for underrepresented languages.

Abstract

Democratizing access to natural language processing (NLP) technology is crucial, especially for underrepresented and extremely low-resource languages. Previous research has focused on developing labeled and unlabeled corpora for these languages through online scraping and document translation. While these methods have proven effective and cost-efficient, we have identified limitations in the resulting corpora, including a lack of lexical diversity and cultural relevance to local communities. To address this gap, we conduct a case study on Indonesian local languages. We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets. Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content. In addition, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

indonlp/nusa-writes
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling