Prompting Multilingual Large Language Models to Generate Code-Mixed   Texts: The Case of South East Asian Languages

Zheng-Xin Yong; Ruochen Zhang; Jessica Zosa Forde; Skyler Wang; Arjun; Subramonian; Holy Lovenia; Samuel Cahyawijaya; Genta Indra Winata; Lintang; Sutawika; Jan Christian Blaise Cruz; Yin Lin Tan; Long Phan; Rowena Garcia,; Thamar Solorio; Alham Fikri Aji

arXiv:2303.13592·cs.CL·September 13, 2023·6 cites

Prompting Multilingual Large Language Models to Generate Code-Mixed Texts: The Case of South East Asian Languages

Zheng-Xin Yong, Ruochen Zhang, Jessica Zosa Forde, Skyler Wang, Arjun, Subramonian, Holy Lovenia, Samuel Cahyawijaya, Genta Indra Winata, Lintang, Sutawika, Jan Christian Blaise Cruz, Yin Lin Tan, Long Phan, Rowena Garcia,, Thamar Solorio, Alham Fikri Aji

PDF

Open Access

TL;DR

This paper investigates the ability of current multilingual large language models to generate code-mixed texts for Southeast Asian languages, revealing significant limitations and variability in their performance across different models and language pairs.

Contribution

It provides a systematic evaluation of multilingual LLMs' capacity to produce code-mixed data for SEA languages, highlighting their inadequacies and the need for human oversight.

Findings

01

Public models like BLOOMZ and Flan-T5-XXL cannot produce mixed-language phrases.

02

ChatGPT's performance varies greatly depending on prompts and language pairs.

03

Existing LLMs are unreliable for code-mixed data generation without human checks.

Abstract

While code-mixing is a common linguistic practice in many parts of the world, collecting high-quality and low-cost code-mixed data remains a challenge for natural language processing (NLP) research. The recent proliferation of Large Language Models (LLMs) compels one to ask: how capable are these systems in generating code-mixed data? In this paper, we explore prompting multilingual LLMs in a zero-shot manner to generate code-mixed data for seven languages in South East Asia (SEA), namely Indonesian, Malay, Chinese, Tagalog, Vietnamese, Tamil, and Singlish. We find that publicly available multilingual instruction-tuned models such as BLOOMZ and Flan-T5-XXL are incapable of producing texts with phrases or clauses from different languages. ChatGPT exhibits inconsistent capabilities in generating code-mixed texts, wherein its performance varies depending on the prompt template and language…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Text Readability and Simplification · Topic Modeling

MethodsBLOOMZ