Reducing Credit Assignment Variance via Counterfactual Reasoning Paths
Fei Ding, Yongkang Zhang, Yeling Peng, Youwei Wang, Guoxiong Zhou, Zijian Zeng

TL;DR
This paper introduces a counterfactual reasoning framework to improve credit assignment in reinforcement learning with large language models, reducing variance and stabilizing training.
Contribution
It proposes a novel implicit process-level advantage estimator and IBPO method, enhancing training stability and performance in reasoning tasks.
Findings
Significantly improves training stability.
Achieves better performance on reasoning benchmarks.
Transforms sparse rewards into step-sensitive signals.
Abstract
Reinforcement learning for multi-step reasoning with large language models (LLMs) often relies on sparse terminal rewards, leading to poor credit assignment conditions where the final feedback is evenly propagated across all intermediate decisions. This results in high gradient variance, unstable training, and numerous ineffective updates, ultimately causing the model to fail and preventing sustained improvement. We introduce a counterfactual comparison-based credit assignment framework, which samples multiple reasoning trajectories under the same input. By treating their differences as an implicit approximation of alternative decisions, we construct an implicit process-level advantage estimator that transforms sparse terminal rewards into step-sensitive learning signals. Based on this, we propose Implicit Behavior Policy Optimization (IBPO), which significantly improves training…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
