Conditioning LLMs to Generate Code-Switched Text
Maite Heredia, Gorka Labaka, Jeremy Barnes, Aitor Soroa

TL;DR
This paper explores how fine-tuning large language models with back-translated code-switched data improves their ability to generate fluent English-Spanish code-switched text, highlighting the importance of human-aligned evaluation methods.
Contribution
It introduces a novel fine-tuning approach using back-translated CS data and provides a comprehensive analysis of model performance and evaluation metrics.
Findings
Fine-tuning enhances fluency in CS text generation.
Traditional metrics do not align with human judgments.
LLM-based judgment correlates better with human preferences.
Abstract
Code-switching (CS) is still a critical challenge in Natural Language Processing (NLP), due to the limited availability of large-scale, diverse CS datasets for robust training and evaluation. Despite recent advances, the capabilities and limitations of LLMs in handling CS are still not fully understood. In this work, we investigate the extent to which LLMs can be used in a framework for CS text generation, focusing on the English-Spanish language pair. Our proposed methodology consists of back-translating natural CS sentences into monolingual English, and using the resulting parallel corpus to fine-tune LLMs to turn monolingual sentences into CS. We thoroughly analyse the models' performance through a study on human preferences, a qualitative error analysis, an evaluation with popular reference-based metrics and LLM-based judgment. Results show that fine-tuning can be a key step to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
