Drop the Act: Probe-Filtered RL for Faithful Chain-of-Thought Reasoning

Swapnil Parekh

arXiv:2605.11467·cs.LG·May 13, 2026

Drop the Act: Probe-Filtered RL for Faithful Chain-of-Thought Reasoning

Swapnil Parekh

PDF

TL;DR

ProFIL is a reinforcement learning extension that uses a probe to detect and suppress unfaithful reasoning steps in models, improving interpretability and efficiency without sacrificing accuracy.

Contribution

It introduces a probe-based method to reduce reasoning theater in models, enhancing faithfulness and chain efficiency across multiple domains.

Findings

01

ProFIL reduces reasoning theater by 11-100% across tasks.

02

It increases the faithfulness of model reasoning by up to 24 percentage points.

03

Chains are shortened by 4-19% while maintaining or improving accuracy.

Abstract

Reasoning models post-hoc rationalize answers they have already committed to internally, producing chains of *reasoning theater*: deliberative-looking steps that contribute nothing to correctness. This wastes inference tokens, pollutes interpretability, and obscures what the model actually computed. We introduce **ProFIL** (**Pro**be-**Fil**tered Reinforcement Learning) to *reduce theater, increase chain-of-thought faithfulness, and shrink chain length* in a single, drop-in extension to Group Relative Policy Optimization (GRPO). A multi-head attention probe is trained *once* on the *frozen* base model to detect post-commitment steps from internal activations alone; during GRPO, rollouts whose probe score exceeds a threshold have their advantage zeroed. *Our central finding is that a probe trained on a frozen base, with verifier-derived labels and no human annotation, provides a stable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.