TL;DR
This paper reveals a novel backdoor attack on continuous latent reasoning in language models, exploiting silent hidden states to reliably manipulate outputs without detection.
Contribution
It introduces ThoughtSteer, a perturbation method that hijacks latent trajectories in language models, bypassing existing defenses and revealing new insights into model interpretability.
Findings
Achieves >=99% attack success rate across multiple models and benchmarks.
Transfers to unseen benchmarks with 94-100% success without retraining.
Evades all five evaluated active defenses and withstands fine-tuning.
Abstract
A new generation of language models reasons entirely in continuous hidden states, producing no tokens and leaving no audit trail. We show that this silence creates a fundamentally new attack surface. ThoughtSteer perturbs a single embedding vector at the input layer; the model's own multi-pass reasoning amplifies this perturbation into a hijacked latent trajectory that reliably produces the attacker's chosen answer, while remaining structurally invisible to every token-level defense. Across two architectures (Coconut and SimCoT), three reasoning benchmarks, and model scales from 124M to 3B parameters, ThoughtSteer achieves >=99% attack success rate with near-baseline clean accuracy, transfers to held-out benchmarks without retraining (94-100%), evades all five evaluated active defenses, and survives 25 epochs of clean fine-tuning. We trace these results to a unifying…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
