CS-Sum: A Benchmark for Code-Switching Dialogue Summarization and the Limits of Large Language Models

Sathya Krishnan Suresh; Tanmay Surana; Lim Zhi Hao; Eng Siong Chng

arXiv:2505.13559·cs.CL·May 21, 2025

CS-Sum: A Benchmark for Code-Switching Dialogue Summarization and the Limits of Large Language Models

Sathya Krishnan Suresh, Tanmay Surana, Lim Zhi Hao, Eng Siong Chng

PDF

Open Access 1 Video

TL;DR

This paper introduces CS-Sum, a benchmark for evaluating how well large language models understand and summarize code-switched dialogues across multiple language pairs, revealing limitations and common errors in current models.

Contribution

The paper presents the first benchmark for code-switching dialogue summarization, including a dataset and analysis of LLM performance across different approaches and language pairs.

Findings

01

LLMs often make subtle errors that change dialogue meaning

02

Error rates vary across language pairs and models

03

Current automated metrics may overestimate LLM performance on CS tasks

Abstract

Code-switching (CS) poses a significant challenge for Large Language Models (LLMs), yet its comprehensibility remains underexplored in LLMs. We introduce CS-Sum, to evaluate the comprehensibility of CS by the LLMs through CS dialogue to English summarization. CS-Sum is the first benchmark for CS dialogue summarization across Mandarin-English (EN-ZH), Tamil-English (EN-TA), and Malay-English (EN-MS), with 900-1300 human-annotated dialogues per language pair. Evaluating ten LLMs, including open and closed-source models, we analyze performance across few-shot, translate-summarize, and fine-tuning (LoRA, QLoRA on synthetic data) approaches. Our findings show that though the scores on automated metrics are high, LLMs make subtle mistakes that alter the complete meaning of the dialogue. To this end, we introduce 3 most common type of errors that LLMs make when handling CS input. Error rates…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

CS-Sum: A Benchmark for Code-Switching Dialogue Summarization and the Limits of Large Language Models· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications