CS-Sum: A Benchmark for Code-Switching Dialogue Summarization and the Limits of Large Language Models
Sathya Krishnan Suresh, Tanmay Surana, Lim Zhi Hao, Eng Siong Chng

TL;DR
This paper introduces CS-Sum, a benchmark for evaluating how well large language models understand and summarize code-switched dialogues across multiple language pairs, revealing limitations and common errors in current models.
Contribution
The paper presents the first benchmark for code-switching dialogue summarization, including a dataset and analysis of LLM performance across different approaches and language pairs.
Findings
LLMs often make subtle errors that change dialogue meaning
Error rates vary across language pairs and models
Current automated metrics may overestimate LLM performance on CS tasks
Abstract
Code-switching (CS) poses a significant challenge for Large Language Models (LLMs), yet its comprehensibility remains underexplored in LLMs. We introduce CS-Sum, to evaluate the comprehensibility of CS by the LLMs through CS dialogue to English summarization. CS-Sum is the first benchmark for CS dialogue summarization across Mandarin-English (EN-ZH), Tamil-English (EN-TA), and Malay-English (EN-MS), with 900-1300 human-annotated dialogues per language pair. Evaluating ten LLMs, including open and closed-source models, we analyze performance across few-shot, translate-summarize, and fine-tuning (LoRA, QLoRA on synthetic data) approaches. Our findings show that though the scores on automated metrics are high, LLMs make subtle mistakes that alter the complete meaning of the dialogue. To this end, we introduce 3 most common type of errors that LLMs make when handling CS input. Error rates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
