Attributional Safety Failures in Large Language Models under Code-Mixed Perturbations

Somnath Banerjee; Pratyush Chatterjee; Shanu Kumar; Sayan Layek; Parag Agrawal; Rima Hazra; Animesh Mukherjee

arXiv:2505.14469·cs.CL·December 2, 2025

Attributional Safety Failures in Large Language Models under Code-Mixed Perturbations

Somnath Banerjee, Pratyush Chatterjee, Shanu Kumar, Sayan Layek, Parag Agrawal, Rima Hazra, Animesh Mukherjee

PDF

Open Access

TL;DR

This paper reveals that large language models become significantly less safe under code-mixed inputs, with safety guardrails failing, especially in non-Western languages, and introduces interpretability and mitigation strategies to address this issue.

Contribution

It uncovers a critical weakness in LLM safety under code-mixing, introduces saliency drift attribution for explanation, and proposes a translation-based method to restore safety.

Findings

01

Safety attack success rates increase from 9% to over 90% in code-mixed scenarios.

02

Saliency drift attribution explains how attention shifts away from safety-critical tokens.

03

Translation-based strategy recovers approximately 80% of safety.

Abstract

While LLMs appear robustly safety-aligned in English, we uncover a catastrophic, overlooked weakness: attributional collapse under code-mixed perturbations. Our systematic evaluation of open models shows that the linguistic camouflage of code-mixing -- ``blending languages within a single conversation'' -- can cause safety guardrails to fail dramatically. Attack success rates (ASR) spike from a benign 9\% in monolingual English to 69\% under code-mixed inputs, with rates exceeding 90\% in non-Western contexts such as Arabic and Hindi. These effects hold not only on controlled synthetic datasets but also on real-world social media traces, revealing a serious risk for billions of users. To explain why this happens, we introduce saliency drift attribution (SDA), an interpretability framework that shows how, under code-mixing, the model's internal attention drifts away from safety-critical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques