IndoRobusta: Towards Robustness Against Diverse Code-Mixed Indonesian Local Languages
Muhammad Farid Adilazuarda, Samuel Cahyawijaya, Genta Indra Winata,, Pascale Fung, Ayu Purwarianti

TL;DR
This paper introduces IndoRobusta, a framework designed to evaluate and enhance the robustness of Indonesian NLP models against diverse code-mixed languages, addressing a gap in handling mixed local languages and English.
Contribution
The paper presents IndoRobusta, a novel framework for assessing and improving model robustness to code-mixed Indonesian with multiple embedded languages.
Findings
Pre-training corpus bias impacts handling of Indonesian-English code-mixing.
Models perform less effectively on local language code-mixing despite higher diversity.
IndoRobusta provides insights into robustness challenges in multilingual Indonesian NLP.
Abstract
Significant progress has been made on Indonesian NLP. Nevertheless, exploration of the code-mixing phenomenon in Indonesian is limited, despite many languages being frequently mixed with Indonesian in daily conversation. In this work, we explore code-mixing in Indonesian with four embedded languages, i.e., English, Sundanese, Javanese, and Malay; and introduce IndoRobusta, a framework to evaluate and improve the code-mixing robustness. Our analysis shows that the pre-training corpus bias affects the model's ability to better handle Indonesian-English code-mixing when compared to other local languages, despite having higher language diversity.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Text Readability and Simplification · Topic Modeling
