Code-Switching Red-Teaming: LLM Evaluation for Safety and Multilingual Understanding
Haneul Yoo, Yongjin Yang, Hwaran Lee

TL;DR
This paper introduces CSRT, a framework for generating code-switching queries to evaluate LLM safety and multilingual understanding, revealing that code-switching can effectively expose undesirable behaviors in LLMs.
Contribution
The paper presents a novel code-switching red-teaming framework, CSRT, that outperforms existing methods and explores multilingual capabilities and safety issues in LLMs.
Findings
CSRT achieves 46.7% more attack success than standard English attacks.
Effective in safety evaluation across multiple languages.
Demonstrates LLMs' multilingual understanding and generation capabilities.
Abstract
As large language models (LLMs) have advanced rapidly, concerns regarding their safety have become prominent. In this paper, we discover that code-switching in red-teaming queries can effectively elicit undesirable behaviors of LLMs, which are common practices in natural language. We introduce a simple yet effective framework, CSRT, to synthesize codeswitching red-teaming queries and investigate the safety and multilingual understanding of LLMs comprehensively. Through extensive experiments with ten state-of-the-art LLMs and code-switching queries combining up to 10 languages, we demonstrate that the CSRT significantly outperforms existing multilingual red-teaming techniques, achieving 46.7% more attacks than standard attacks in English and being effective in conventional safety domains. We also examine the multilingual ability of those LLMs to generate and understand codeswitching…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSemantic Web and Ontologies · Digital Rights Management and Security
