Drop the Act: Probe-Filtered RL for Faithful Chain-of-Thought Reasoning
Swapnil Parekh

TL;DR
ProFIL is a reinforcement learning extension that uses a probe to detect and suppress unfaithful reasoning steps in models, improving interpretability and efficiency without sacrificing accuracy.
Contribution
It introduces a probe-based method to reduce reasoning theater in models, enhancing faithfulness and chain efficiency across multiple domains.
Findings
ProFIL reduces reasoning theater by 11-100% across tasks.
It increases the faithfulness of model reasoning by up to 24 percentage points.
Chains are shortened by 4-19% while maintaining or improving accuracy.
Abstract
Reasoning models post-hoc rationalize answers they have already committed to internally, producing chains of *reasoning theater*: deliberative-looking steps that contribute nothing to correctness. This wastes inference tokens, pollutes interpretability, and obscures what the model actually computed. We introduce **ProFIL** (**Pro**be-**Fil**tered Reinforcement Learning) to *reduce theater, increase chain-of-thought faithfulness, and shrink chain length* in a single, drop-in extension to Group Relative Policy Optimization (GRPO). A multi-head attention probe is trained *once* on the *frozen* base model to detect post-commitment steps from internal activations alone; during GRPO, rollouts whose probe score exceeds a threshold have their advantage zeroed. *Our central finding is that a probe trained on a frozen base, with verifier-derived labels and no human annotation, provides a stable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
