Thinking Wrong in Silence: Backdoor Attacks on Continuous Latent Reasoning

Swapnil Parekh

arXiv:2604.00770·cs.LG·April 2, 2026

Thinking Wrong in Silence: Backdoor Attacks on Continuous Latent Reasoning

Swapnil Parekh

PDF

1 Repo

TL;DR

This paper reveals a novel backdoor attack on continuous latent reasoning in language models, exploiting silent hidden states to reliably manipulate outputs without detection.

Contribution

It introduces ThoughtSteer, a perturbation method that hijacks latent trajectories in language models, bypassing existing defenses and revealing new insights into model interpretability.

Findings

01

Achieves >=99% attack success rate across multiple models and benchmarks.

02

Transfers to unseen benchmarks with 94-100% success without retraining.

03

Evades all five evaluated active defenses and withstands fine-tuning.

Abstract

A new generation of language models reasons entirely in continuous hidden states, producing no tokens and leaving no audit trail. We show that this silence creates a fundamentally new attack surface. ThoughtSteer perturbs a single embedding vector at the input layer; the model's own multi-pass reasoning amplifies this perturbation into a hijacked latent trajectory that reliably produces the attacker's chosen answer, while remaining structurally invisible to every token-level defense. Across two architectures (Coconut and SimCoT), three reasoning benchmarks, and model scales from 124M to 3B parameters, ThoughtSteer achieves >=99% attack success rate with near-baseline clean accuracy, transfers to held-out benchmarks without retraining (94-100%), evades all five evaluated active defenses, and survives 25 epochs of clean fine-tuning. We trace these results to a unifying…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

null
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.