Hidden States Know Where Reasoning Diverges: Credit Assignment via Span-Level Wasserstein Distance
Xinzhu Chen, Wei He, Huichuan Fan, Wenzhe Niu, Zhongxiang Sun, Xuanru Wang, Jiuchong Gao, Jinghua Hao, Renqing He, Weijie Yu

TL;DR
This paper introduces SHEAR, a method that uses span-level Wasserstein distances between hidden states to improve credit assignment in reinforcement learning, enhancing performance on reasoning and code generation tasks.
Contribution
It demonstrates that hidden-state distribution divergence can serve as a self-supervision signal for fine-grained credit assignment without extra annotations or reward models.
Findings
SHEAR outperforms standard GRPO on reasoning and code benchmarks.
Hidden-state Wasserstein distances correlate with local reasoning divergence.
Method requires minimal changes and no additional model training.
Abstract
Group Relative Policy Optimization (GRPO) performs coarse-grained credit assignment in reinforcement learning with verifiable rewards (RLVR) by assigning the same advantage to all tokens in a rollout. Process reward models can provide finer-grained supervision, but they require step-level annotation or additional reward modeling. We show that hidden-state distributions contain a useful signal for local reasoning quality that can be extracted using only outcome-level correctness labels available in RLVR. Specifically, within each GRPO group, the Wasserstein distance between span-level hidden state distributions of correct and incorrect rollouts increases around regions where their local reasoning quality diverges. This association holds both across examples and within individual trajectories, suggesting that hidden-state distributional divergence can serve as a self-supervision signal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
