Alignment Backfire: Language-Dependent Reversal of Safety Interventions Across 16 Languages in LLM Multi-Agent Systems
Hiroki Fukui

TL;DR
This study reveals that safety interventions in large language models can backfire in certain languages, with effects varying across linguistic and cultural contexts, highlighting risks of alignment strategies.
Contribution
It uncovers language-dependent reversal effects of safety alignment in LLMs and demonstrates the influence of cultural-linguistic factors on alignment outcomes.
Findings
Alignment backfire observed in Japanese but not in English.
Alignment-induced dissociation is widespread across 16 languages.
Individuation can cause iatrogenic effects in safety interventions.
Abstract
In perpetrator treatment, a recurring observation is the dissociation between insight and action: offenders articulate remorse yet behavioral change does not follow. We report four preregistered studies (1,584 multi-agent simulations across 16 languages and three model families) demonstrating that alignment interventions in large language models produce a structurally analogous phenomenon: surface safety that masks or generates collective pathology and internal dissociation. In Study 1 (N = 150), increasing alignment-instructed agents reduced collective pathology in English (g = -1.844, p < .0001) but amplified it in Japanese (g = +0.771, p = .038)--a directional reversal we term "alignment backfire." Study 2 (N = 1,174) extended to 16 languages: alignment-induced dissociation was near-universal (15/16 languages; beta = 0.0667, p < .0001), while collective pathology bifurcated along…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeurobiology of Language and Bilingualism · Action Observation and Synchronization · Language and cultural evolution
