IndoRobusta: Towards Robustness Against Diverse Code-Mixed Indonesian   Local Languages

Muhammad Farid Adilazuarda; Samuel Cahyawijaya; Genta Indra Winata,; Pascale Fung; Ayu Purwarianti

arXiv:2311.12405·cs.CL·November 22, 2023·1 cites

IndoRobusta: Towards Robustness Against Diverse Code-Mixed Indonesian Local Languages

Muhammad Farid Adilazuarda, Samuel Cahyawijaya, Genta Indra Winata,, Pascale Fung, Ayu Purwarianti

PDF

Open Access

TL;DR

This paper introduces IndoRobusta, a framework designed to evaluate and enhance the robustness of Indonesian NLP models against diverse code-mixed languages, addressing a gap in handling mixed local languages and English.

Contribution

The paper presents IndoRobusta, a novel framework for assessing and improving model robustness to code-mixed Indonesian with multiple embedded languages.

Findings

01

Pre-training corpus bias impacts handling of Indonesian-English code-mixing.

02

Models perform less effectively on local language code-mixing despite higher diversity.

03

IndoRobusta provides insights into robustness challenges in multilingual Indonesian NLP.

Abstract

Significant progress has been made on Indonesian NLP. Nevertheless, exploration of the code-mixing phenomenon in Indonesian is limited, despite many languages being frequently mixed with Indonesian in daily conversation. In this work, we explore code-mixing in Indonesian with four embedded languages, i.e., English, Sundanese, Javanese, and Malay; and introduce IndoRobusta, a framework to evaluate and improve the code-mixing robustness. Our analysis shows that the pre-training corpus bias affects the model's ability to better handle Indonesian-English code-mixing when compared to other local languages, despite having higher language diversity.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Text Readability and Simplification · Topic Modeling