Outcome-Grounded Advantage Reshaping for Fine-Grained Credit Assignment in Mathematical Reasoning

Ziheng Li; Liu Kang; Feng Xiao; Luxi Xing; Qingyi Si; Zhuoran Li; Weikang Gong; Deqing Yang; Yanghua Xiao; Hongcheng Guo

arXiv:2601.07408·cs.CL·January 13, 2026

Outcome-Grounded Advantage Reshaping for Fine-Grained Credit Assignment in Mathematical Reasoning

Ziheng Li, Liu Kang, Feng Xiao, Luxi Xing, Qingyi Si, Zhuoran Li, Weikang Gong, Deqing Yang, Yanghua Xiao, Hongcheng Guo

PDF

Open Access

TL;DR

This paper introduces Outcome-grounded Advantage Reshaping (OAR), a fine-grained credit assignment method for reinforcement learning in mathematical reasoning, improving performance by better attributing token contributions.

Contribution

The paper proposes OAR, a novel advantage reshaping technique that enhances credit assignment in critic-free RL for reasoning tasks, with two strategies OAR-P and OAR-G.

Findings

01

OAR-P achieves the highest performance upper bound.

02

OAR-G provides comparable gains with minimal computational cost.

03

Both methods outperform existing GRPO baseline significantly.

Abstract

Group Relative Policy Optimization (GRPO) has emerged as a promising critic-free reinforcement learning paradigm for reasoning tasks. However, standard GRPO employs a coarse-grained credit assignment mechanism that propagates group-level rewards uniformly to to every token in a sequence, neglecting the varying contribution of individual reasoning steps. We address this limitation by introducing Outcome-grounded Advantage Reshaping (OAR), a fine-grained credit assignment mechanism that redistributes advantages based on how much each token influences the model's final answer. We instantiate OAR via two complementary strategies: (1) OAR-P, which estimates outcome sensitivity through counterfactual token perturbations, serving as a high-fidelity attribution signal; (2) OAR-G, which uses an input-gradient sensitivity proxy to approximate the influence signal with a single backward pass.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Adaptive Dynamic Programming Control · Stochastic Gradient Optimization Techniques