TL;DR
This paper introduces PACR, a dense reward mechanism for LLM reasoning that guides the model's belief trajectory, improving exploration efficiency and performance in reinforcement learning settings.
Contribution
PACR is a novel, model-intrinsic reward that encodes ascending confidence in the correct answer, enhancing reinforcement learning for LLM reasoning tasks.
Findings
PACR accelerates exploration and reaches reward saturation faster.
PACR improves performance on multiple reasoning benchmarks.
Theoretical analysis confirms the inductive bias constrains exploration to sound reasoning paths.
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has significantly improved LLM reasoning, but its sparse, outcome-based reward provides no guidance for intermediate steps, slowing exploration. We propose Progressively Ascending Confidence Reward (PACR), a dense, model-intrinsic reward computed directly from the model's evolving belief in the correct answer. PACR encodes the inductive bias that, along a well-formed reasoning trajectory, the probability of the ground-truth answer should have a generally ascending trend. We provide empirical and theoretical analysis validating that such an inductive bias constrains the exploration search space to regions richer in logically sound reasoning. We demonstrate that PACR accelerates exploration, reaches reward saturation with fewer trajectories, and yields improvements on multiple benchmarks. Our results suggest that dense, model-intrinsic…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The problem addressed in this paper is important. Incorporating more dense reward information into current RL pipelines remains an underexplored direction. 2. The proposed idea is reasonable and aligns with recent studies showing that a model’s reasoning confidence often correlates with the correctness of its answers, suggesting that such confidence signals could be valuable for training. 3. The writing is clear and easy to follow.
1. The approach for determining the reasoning step appears rather ad-hoc. It is unclear how this mechanism would transfer to other domains such as code generation, where the output often contains many new lines. Would this lead to an excessive number of reasoning steps for tasks involving long-context generation? 2. The training process seems to introduce additional computational overhead, particularly as the generation length increases, which could significantly inflate the number of reasoning
The paper addresses a key problem in RL training: that dense rewards are hard and expensive to acquire. The paper's presentation is clear. The supporting evidence (observations 1, 2, and 3) for the method is mostly relevant and well thought-out. The gain from the methods is good over the baseline Dr.DRPO , though in some certain test datasets it is negative.
The experiment is slightly limited, with only one training dataset and three models, and one baseline algorithm (Dr.GRPO). I would like to see one reasoning model (e.g., DeepSeek-R1-Distill-Qwen-1.5B) tested to see if the effectiveness of your proposed process reward still holds. Also, report the accuracy on AIME 2025. I dislike the inclusion of Section 4.2, as it makes the proposed method look deep, whereas the key to the proof is really the artificial "oracle policy assumption". I think that
1. The paper gives a very clear inductive bias: along a good reasoning path, confidence in the ground truth should tend to go up. Turning this into a dense reward that needs no extra reward model is practical and clean. 2. The paper checks three open models and five math benchmarks. The main table shows Dense PACR improves average pass@1 over a strong Dr GRPO baseline, for example +2.5 on the 1.5B model and +3.0 on the 7B model.
1. Experiments only cover math datasets. Many recent results also evaluate general reasoning and code. It is not clear if PACR transfers beyond numeric answers or beyond tasks where the final answer is exactly verifiable, for example, long form QA or proofs with multiple valid forms. Comparison to broader settings in R1-style training or DAPO-like systems would strengthen the claim. 2. The proof shows that the expected confidence gain is non-negative when steps are drawn from the ground truth co
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
