The Point of No Return: Counterfactual Localization of Deceptive Commitment in Language-Model Reasoning
Scott Merrill, Shashank Srivastava

TL;DR
This paper introduces counterfactual localization to identify when language models become committed to deception during reasoning, using a large-scale, environment-agnostic dataset and attention-based features.
Contribution
It presents a novel method for pinpointing deceptive commitment points in reasoning traces and provides a large, diverse dataset for studying deception in language models.
Findings
Attention-based transition features generalize across environments.
Compact attention-head sets can suppress deceptive commitment.
Detected commitment points align with interpretable decision shifts.
Abstract
Existing deception datasets label completed outputs as honest or deceptive, treating deception as a property of the final response rather than a function of the model's reasoning trace. This obscures a more fundamental question: when does a language model become committed to deception? We introduce counterfactual localization: for each sentence prefix in a reasoning trace, we fix the prefix, resample continuations, and estimate the probability of a deceptive outcome. To scale this, we construct five environments (spanning strategic bluffing, maze guidance, financial advice, used-car sales, and offer negotiation) in which deception is never prompted but emerges from strategic incentives and labels follow mechanically from environment state rather than subjective human judgment. The resulting corpus localizes 1.46M sentences across four reasoning models, drawn from over 94.1M…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
