TL;DR
This paper introduces GTPO and GRPO-S algorithms that utilize entropy-based reward shaping for more fine-grained and stable reinforcement learning in large language models, improving reasoning capabilities.
Contribution
The paper proposes novel entropy-weighted reward redistribution methods, enabling token and sequence-level reward shaping in RL for LLMs, enhancing training stability and reasoning performance.
Findings
GTPO assigns token-specific rewards based on entropy weights.
GRPO-S extends reward shaping to sequence level, improving stability.
Algorithms outperform existing methods in long Chain-of-Thought tasks.
Abstract
Reinforcement Learning (RL) is pivotal for enhancing Large Language Model (LLM) reasoning, yet mainstream algorithms such as GRPO and DAPO remain constrained by a coarse-grained credit assignment paradigm, where all tokens within the same response receive the identical reward. In this paper, we propose Dynamic Entropy Weighting, systematically define entropy-based weight ratios and similar variants to redistribute rewards and get fine-grained rewards through two new algorithms: Group Token Policy Optimization (GTPO), which assigns an entropy-weighted reward to each token and synthesizes token-specific advantage function to drive the model toward optimal path, and the analogous algorithm Sequence-Level GRPO (GRPO-S), which extends this design to the sequence level and exhibits superior stability in long Chain-of-Thought (CoT) reasoning tasks.
Peer Reviews
Decision·Submitted to ICLR 2026
1. The paper clearly identifies coarse-grained credit assignment as a fundamental limitation in current RL fine-tuning of LLMs. 2. Instead of treating policy entropy as mere “uncertainty,” the paper repurposes it as a proxy for cognitive effort, rewarding high-entropy decisions in correct answers and penalizing overconfident errors.
1. The paper discussion of related literature is limited in scope, potentially missing important context. Many earlier works have attempted token-level rewards or alternative credit assignment heuristics, the paper should contrast with them. Besides, given the focus on entropy, one might expect references to prior uses of entropy or uncertainty in exploration or credit assignment. 2. Some aspects of the technical presentation suffer from notation inconsistencies or ambiguity, which could confuse
- The proposed solution is simple yet effective. - The paper is easy to follow.
- The major concern is the experimental evaluation. The experiments are only conducted on two Qwen2.5 series models and AIME benchmarks only. The authors are suggested to conduct more experiments on other base models (e.g., Llama and DeepSeek-R1-Distill series, etc.) and other benchmarks (e.g., AMC23, Minerva Math, OlympiadBench, LiveCodeBench) to validate the effectiveness and generalization of the proposed methods. - The proposed methods introduce four hyperparameters (i.e., $\alpha_1$, $\alph
1. The method is well motivated and easy to understand or implement. 2. The experiments show a clear advantage over GRPO and DAPO. 3. The authors discussed future works and limitations of the paper in detail.
1. The mathematical derivation of why we have Equations 3 and 5 is unclear. Why does this give us a better policy gradient in theory? 2. The authors tried to give some theoretical guarantees in the paper. However, all the results are in the appendix. I'm expecting at least the theorems themselves to appear in the main text. Besides, it is unclear to me why the proposed PG is unbiased. The 'approximately equal to' statement is not a formal statement that can act as a theoretical guarantee. 3. It
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
