Prompting Multilingual Large Language Models to Generate Code-Mixed Texts: The Case of South East Asian Languages
Zheng-Xin Yong, Ruochen Zhang, Jessica Zosa Forde, Skyler Wang, Arjun, Subramonian, Holy Lovenia, Samuel Cahyawijaya, Genta Indra Winata, Lintang, Sutawika, Jan Christian Blaise Cruz, Yin Lin Tan, Long Phan, Rowena Garcia,, Thamar Solorio, Alham Fikri Aji

TL;DR
This paper investigates the ability of current multilingual large language models to generate code-mixed texts for Southeast Asian languages, revealing significant limitations and variability in their performance across different models and language pairs.
Contribution
It provides a systematic evaluation of multilingual LLMs' capacity to produce code-mixed data for SEA languages, highlighting their inadequacies and the need for human oversight.
Findings
Public models like BLOOMZ and Flan-T5-XXL cannot produce mixed-language phrases.
ChatGPT's performance varies greatly depending on prompts and language pairs.
Existing LLMs are unreliable for code-mixed data generation without human checks.
Abstract
While code-mixing is a common linguistic practice in many parts of the world, collecting high-quality and low-cost code-mixed data remains a challenge for natural language processing (NLP) research. The recent proliferation of Large Language Models (LLMs) compels one to ask: how capable are these systems in generating code-mixed data? In this paper, we explore prompting multilingual LLMs in a zero-shot manner to generate code-mixed data for seven languages in South East Asia (SEA), namely Indonesian, Malay, Chinese, Tagalog, Vietnamese, Tamil, and Singlish. We find that publicly available multilingual instruction-tuned models such as BLOOMZ and Flan-T5-XXL are incapable of producing texts with phrases or clauses from different languages. ChatGPT exhibits inconsistent capabilities in generating code-mixed texts, wherein its performance varies depending on the prompt template and language…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Text Readability and Simplification · Topic Modeling
MethodsBLOOMZ
