Measuring and curing reasoning rigidity: from decorative chain-of-thought to genuine faithfulness
Abhinaba Basu, Pavan Chakraborty

TL;DR
This paper introduces the SLRC metric and LC-CoSR training method to measure and reduce reasoning rigidity in language models, improving faithfulness and robustness.
Contribution
It proposes a new metric for reasoning rigidity, a training method with stability guarantees, and evaluates models revealing the impact of RL-based training on reasoning faithfulness.
Findings
OpenAI's o4-mini has the highest SLRC among evaluated models.
RL-based reasoning training influences faithfulness more than thinking tokens.
High-SLRC models are more susceptible to sycophancy, leading to the RIS metric.
Abstract
Language models increasingly show their work by writing step-by-step reasoning before answering. But are these steps genuinely used, or is the answer rigid - fixed before reasoning begins? We introduce the Step-Level Reasoning Capacity (SLRC) metric and prove it is a consistent causal estimator (Theorem 1). We propose LC-CoSR, a training method with Lyapunov stability guarantees that directly reduces rigidity. Evaluating 16 frontier models (o4-mini, GPT-5.4, Claude Opus, Grok-4, DeepSeek-R1, Gemini 2.5 Pro, and others) across six domains at N=133-500, we find reasoning falls into three modes. OpenAI's o4-mini shows 74-88% step necessity on five of six tasks (73.8-88.3%) - the highest SLRC in our study. The critical differentiator is RL-based reasoning training, not thinking tokens: Grok-4's reasoning mode shows lower faithfulness than its non-reasoning mode (1.4% vs 7.2% necessity).…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
