Why Do Safety Guardrails Degrade Across Languages?

Max Zhang; Ameen Patel; Sang T. Truong; Sanmi Koyejo

arXiv:2605.17173·cs.CL·May 19, 2026

Why Do Safety Guardrails Degrade Across Languages?

Max Zhang, Ameen Patel, Sang T. Truong, Sanmi Koyejo

PDF

TL;DR

This paper introduces a latent variable model to analyze safety degradation in large language models across languages, revealing nuanced factors influencing safety failures and enabling fairer cross-lingual safety evaluation.

Contribution

The authors develop a Multi-Group Item Response Theory framework that decouples safety factors, providing detailed insights into cross-lingual safety vulnerabilities of language models.

Findings

01

Safety is primarily unidimensional, with models refusing harm types through a shared mechanism.

02

22 configurations are more vulnerable in English than in low-resource languages.

03

High cross-lingual safety gap prompts cluster in physical harm categories and lower-resource languages.

Abstract

Large language models exhibit safety degradation in non-English languages. Standard evaluation relies on Jailbreak Success Rate (JSR), which confounds several safety-driving factors into one, obscuring the specific cause(s) of safety failure. We introduce a latent variable model, a Multi-Group Item Response Theory (IRT) framework, that decouples safety-driving factors such as language-agnostic safety robustness ( $θ$ ), intrinsic prompt hardness ( $β$ ), global language processing difficulty ( $γ$ ), and a prompt-specific cross-lingual safety gap ( $τ$ ). Using the MultiJail dataset, we evaluate the safety robustness of 61 model configurations across 5 closed-model families and 10 languages of varying resource, aggregating a dataset of 1.9 million rows. Exploratory Factor Analysis shows safety is primarily unidimensional: models refuse different harm types mainly through a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.