Why Do Safety Guardrails Degrade Across Languages?
Max Zhang, Ameen Patel, Sang T. Truong, Sanmi Koyejo

TL;DR
This paper introduces a latent variable model to analyze safety degradation in large language models across languages, revealing nuanced factors influencing safety failures and enabling fairer cross-lingual safety evaluation.
Contribution
The authors develop a Multi-Group Item Response Theory framework that decouples safety factors, providing detailed insights into cross-lingual safety vulnerabilities of language models.
Findings
Safety is primarily unidimensional, with models refusing harm types through a shared mechanism.
22 configurations are more vulnerable in English than in low-resource languages.
High cross-lingual safety gap prompts cluster in physical harm categories and lower-resource languages.
Abstract
Large language models exhibit safety degradation in non-English languages. Standard evaluation relies on Jailbreak Success Rate (JSR), which confounds several safety-driving factors into one, obscuring the specific cause(s) of safety failure. We introduce a latent variable model, a Multi-Group Item Response Theory (IRT) framework, that decouples safety-driving factors such as language-agnostic safety robustness (), intrinsic prompt hardness (), global language processing difficulty (), and a prompt-specific cross-lingual safety gap (). Using the MultiJail dataset, we evaluate the safety robustness of 61 model configurations across 5 closed-model families and 10 languages of varying resource, aggregating a dataset of 1.9 million rows. Exploratory Factor Analysis shows safety is primarily unidimensional: models refuse different harm types mainly through a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
