NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local Languages
Genta Indra Winata, Alham Fikri Aji, Samuel Cahyawijaya, Rahmad, Mahendra, Fajri Koto, Ade Romadhony, Kemal Kurniawan, David Moeljadi, Radityo, Eko Prasojo, Pascale Fung, Timothy Baldwin, Jey Han Lau, Rico Sennrich,, Sebastian Ruder

TL;DR
This paper introduces NusaX, a comprehensive multilingual dataset for 10 Indonesian languages, including parallel datasets, benchmarks, and lexicons, to advance NLP research for low-resource and endangered languages.
Contribution
The creation of the first parallel resource for 10 Indonesian languages, including datasets, benchmarks, and lexicons, addressing data scarcity in low-resource languages.
Findings
Developed the first parallel dataset for 10 Indonesian languages.
Provided extensive analysis of challenges in resource creation.
Laid groundwork for future NLP research on underrepresented languages.
Abstract
Natural language processing (NLP) has a significant impact on society via technologies such as machine translation and search engines. Despite its success, NLP technology is only widely available for high-resource languages such as English and Chinese, while it remains inaccessible to many languages due to the unavailability of data resources and benchmarks. In this work, we focus on developing resources for languages in Indonesia. Despite being the second most linguistically diverse country, most languages in Indonesia are categorized as endangered and some are even extinct. We develop the first-ever parallel resource for 10 low-resource languages in Indonesia. Our resource includes datasets, a multi-task benchmark, and lexicons, as well as a parallel Indonesian-English dataset. We provide extensive analyses and describe the challenges when creating such resources. We hope that our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Sentiment Analysis and Opinion Mining
