CAPO: Towards Enhancing LLM Reasoning through Generative Credit Assignment
Guofu Xie, Yunsheng Shi, Hongtao Tian, Ting Yao, Xiao Zhang

TL;DR
CAPO introduces a novel method leveraging large language models as generative reward models to provide step-wise, deterministic credit assignment, significantly improving reasoning accuracy in LLMs across multiple benchmarks.
Contribution
The paper proposes CAPO, a simple, efficient approach that directly uses LLMs for step-wise critique, overcoming limitations of existing reward models and enhancing reasoning in LLMs.
Findings
CAPO outperforms supervised and RL fine-tuning on mathematical benchmarks.
CAPO improves reasoning pathways leading to correct answers.
CAPO demonstrates robustness across various models and benchmarks.
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has improved the reasoning abilities of Large Language Models (LLMs) by using rule-based binary feedback. However, current RLVR methods typically assign the same reward to every token. This coarse-grained feedback hampers precise credit assignment, making it hard for models to identify which reasoning steps lead to success or failure, and often results in suboptimal policies. Methods like PPO provide credit assignment by value estimation, but yield inaccurate and unverifiable signals due to limited sampling. On the other hand, methods using Process Reward Models can provide step-wise rewards but suffer from several key limitations: they require high-quality process supervision labels, the feedback is unreliable due to probabilistic reward modeling, and their application in online reinforcement learning (RL) is time-consuming. To…
Peer Reviews
Decision·Submitted to ICLR 2026
The paper presents a creative synthesis of existing ideas. Using a general-purpose LLM as a verifier for process supervision is a simple yet powerful insight that effectively bypasses the significant overhead of training dedicated PRMs while leveraging the inherent reasoning capabilities of modern LLMs. The formulation of the credit assignment mechanism and the analysis of the outcome-process reward trade-off are novel contributions to the RLVR framework.
While efficient, the method introduces non-trivial computational overhead compared to simpler rule-based verifiers such as GRPO. Each policy rollout requires multiple inference passes with a very large LLM (GenPRM). Although the paper correctly notes this aligns with trends of using large models as guides, a more detailed discussion of the actual cost—such as GPU hours or a direct wall-clock time comparison with GRPO—would strengthen the practicality claim. The paper is transparent about this co
1. Methodological Innovation: The method proposes an innovative approach using an LLM to implement a stepwise PRM. This design provides a more granular form of credit assignment by attributing different reward values to tokens depending on their corresponding reasoning step, which is a significant step beyond whole-response-based RL methods. 2. The paper includes a valuable ablation study on the different weights of the $P$ term (as presented in Table 4). This analysis effectively investigates
1. The definition and segmentation of the output reasoning steps is a fundamental and critical aspect of this method. However, the paper only briefly addresses this matter in Appendix D. There is no sufficient justification for the forced output mechanism using markers like <step k>. The reader is left to wonder if a more principled or adaptive step segmentation strategy could be devised to yield a significantly better-performing PRM. 2. The method only achieves a not entirely convincing improv
1. CAPO is "elegantly simple" and avoids the need for complex, time-consuming auxiliary models (like PRMs or value models in PPO) or costly high-quality process supervision data. It efficiently generates all step-wise credit in a single pass. 2. By prompting a powerful LLM to focus on the intrinsic correctness of each step, CAPO provides deterministic process credits, which are more reliable and less susceptible to reward hacking compared to probabilistic, estimation-based signals. 3. The fram
1. The efficacy and reliability of the process credit mechanism are predicated on the use of an LLM-as-GenPRM that is significantly more capable than the policy model being trained (e.g., using a 70B+ model to guide a 1B-7B model). This relies on the availability of a superior external model. 2. The introduction of process-level rewards creates a potential conflict with the original outcome-level reward signal. This required an in-depth analysis and the proposal of a non-trivial asymmetric rewa
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Rights Management and Security · Business Process Modeling and Analysis
