Language-Switching Triggers Take a Latent Detour Through Language Models

Francis Kulumba; Wissam Antoun; Th\'eo Lasnier; Beno\^it Sagot; Djam\'e Seddah

arXiv:2605.18646·cs.CL·May 19, 2026

Language-Switching Triggers Take a Latent Detour Through Language Models

Francis Kulumba, Wissam Antoun, Th\'eo Lasnier, Beno\^it Sagot, Djam\'e Seddah

PDF

TL;DR

This paper uncovers the internal circuit mechanism behind a language-switching backdoor in a large language model, revealing how a Latin trigger redirects output from English to French through a three-phase process.

Contribution

It identifies and decomposes the circuit responsible for a language-switching backdoor in a language model, highlighting the latent encoding and bottleneck structure involved.

Findings

01

The backdoor trigger operates through a three-phase circuit in the model.

02

The trigger signal propagates orthogonally to the natural language-identity direction.

03

Disrupting the bottleneck position effectively mitigates the backdoor.

Abstract

Backdoor attacks on language models pose a growing security concern, yet the internal mechanisms by which a trigger sequence hijacks model computations remain poorly understood. We identify a circuit underlying a language-switching backdoor in an 8B-parameter autoregressive language model, where a three-word Latin trigger (nine tokens) redirects English output to French. We decompose the circuit into three phases: (1) distributed attention heads at early layers compose the trigger tokens into the last sequence position; (2) the resulting signal propagates through mid-layers in a subspace orthogonal to the model's natural language-identity direction; (3) the MLP at the final layer converts this latent signal into French logits. The entire circuit flows through a serial bottleneck at a single position: corrupting that position at any layer entirely mitigate the trigger but also hinder the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.