Reducing Credit Assignment Variance via Counterfactual Reasoning Paths

Fei Ding; Yongkang Zhang; Yeling Peng; Youwei Wang; Guoxiong Zhou; Zijian Zeng

arXiv:2605.16302·cs.LG·May 19, 2026

Reducing Credit Assignment Variance via Counterfactual Reasoning Paths

Fei Ding, Yongkang Zhang, Yeling Peng, Youwei Wang, Guoxiong Zhou, Zijian Zeng

PDF

TL;DR

This paper introduces a counterfactual reasoning framework to improve credit assignment in reinforcement learning with large language models, reducing variance and stabilizing training.

Contribution

It proposes a novel implicit process-level advantage estimator and IBPO method, enhancing training stability and performance in reasoning tasks.

Findings

01

Significantly improves training stability.

02

Achieves better performance on reasoning benchmarks.

03

Transforms sparse rewards into step-sensitive signals.

Abstract

Reinforcement learning for multi-step reasoning with large language models (LLMs) often relies on sparse terminal rewards, leading to poor credit assignment conditions where the final feedback is evenly propagated across all intermediate decisions. This results in high gradient variance, unstable training, and numerous ineffective updates, ultimately causing the model to fail and preventing sustained improvement. We introduce a counterfactual comparison-based credit assignment framework, which samples multiple reasoning trajectories under the same input. By treating their differences as an implicit approximation of alternative decisions, we construct an implicit process-level advantage estimator that transforms sparse terminal rewards into step-sensitive learning signals. Based on this, we propose Implicit Behavior Policy Optimization (IBPO), which significantly improves training…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.