Discovering Process-Outcome Credit in Multi-Step LLM Reasoning
Xiangwei Wang, Wei Wang, Ken Chen, Nanduni Nimalsiri, and Saman Halgamuge

TL;DR
This paper introduces a novel reinforcement learning framework for large language models that provides continuous, process-oriented rewards to improve reasoning accuracy, efficiency, and robustness across various benchmarks.
Contribution
It proposes a Step-wise Marginal Information Gain mechanism, a Decoupled Masking Strategy, and a Dual-Gated SFT objective to enhance reasoning in LLMs with continuous rewards and disentangled credit assignment.
Findings
Outperforms baselines like GRPO in sample efficiency and accuracy
Demonstrates superior out-of-distribution robustness
Shows promising zero-shot transfer to unseen tasks
Abstract
Reinforcement Learning (RL) serves as a potent paradigm for enhancing reasoning capabilities in Large Language Models (LLMs), yet standard outcome-based approaches often suffer from reward sparsity and inefficient credit assignment. In this paper, we propose a novel framework designed to provide continuous reward signals, which introduces a Step-wise Marginal Information Gain (MIG) mechanism that quantifies the intrinsic value of reasoning steps against a Monotonic Historical Watermark, effectively filtering out training noise. To ensure disentangled credit distribution, we implement a Decoupled Masking Strategy, applying process-oriented rewards specifically to the chain-of-thought (CoT) and outcome-oriented rewards to the full completion. Additionally, we incorporate a Dual-Gated SFT objective to stabilize training with high-quality structural and factual signals. Extensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications
