Leveraging Large Language Models for Code-Mixed Data Augmentation in Sentiment Analysis

Linda Zeng

arXiv:2411.00691·cs.CL·October 28, 2025

Leveraging Large Language Models for Code-Mixed Data Augmentation in Sentiment Analysis

Linda Zeng

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper demonstrates that large language models can effectively generate synthetic code-mixed data to improve sentiment analysis, especially in low-resource scenarios, by enhancing model performance with cost-effective data augmentation.

Contribution

It introduces a novel approach using large language models for synthetic code-mixed data generation to boost sentiment analysis accuracy.

Findings

01

Synthetic data improved F1 score by 9.32% in Spanish-English.

02

Synthetic data benefits low baseline models more than high baseline models.

03

Human evaluation confirmed the naturalness and cost-effectiveness of generated code-mixed sentences.

Abstract

Code-mixing (CM), where speakers blend languages within a single expression, is prevalent in multilingual societies but poses challenges for natural language processing due to its complexity and limited data. We propose using a large language model to generate synthetic CM data, which is then used to enhance the performance of task-specific models for CM sentiment analysis. Our results show that in Spanish-English, synthetic data improved the F1 score by 9.32%, outperforming previous augmentation techniques. However, in Malayalam-English, synthetic data only helped when the baseline was low; with strong natural data, additional synthetic data offered little benefit. Human evaluation confirmed that this approach is a simple, cost-effective way to generate natural-sounding CM sentences, particularly beneficial for low baselines. Our findings suggest that few-shot prompting of large…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lindazeng979/llm-cmsa
noneOfficial

Videos

Leveraging Large Language Models for Code-Mixed Data Augmentation in Sentiment Analysis· underline

Taxonomy

TopicsTopic Modeling · Speech Recognition and Synthesis · Sentiment Analysis and Opinion Mining