Do Not Let Low-Probability Tokens Over-Dominate in RL for LLMs
Zhihe Yang, Xufang Luo, Zilong Wang, Dongqi Han, Zhiyuan He, Dongsheng Li, Yunjian Xu

TL;DR
This paper addresses the issue of low-probability tokens dominating gradients in RL training of LLMs, proposing methods to balance token influence and improve reasoning performance.
Contribution
It introduces Advantage Reweighting and Lopti, two techniques to mitigate low-probability token dominance in RL training of LLMs, enhancing learning efficiency.
Findings
Up to 46.2% improvement in reasoning tasks
Effective attenuation of low-probability token gradients
Enhanced balanced token updates in RL training
Abstract
Reinforcement learning (RL) has become a cornerstone for enhancing the reasoning capabilities of large language models (LLMs), with recent innovations such as Group Relative Policy Optimization (GRPO) demonstrating exceptional effectiveness. In this study, we identify a critical yet underexplored issue in RL training: low-probability tokens disproportionately influence model updates due to their large gradient magnitudes. This dominance hinders the effective learning of high-probability tokens, whose gradients are essential for LLMs' performance but are substantially suppressed. To mitigate this interference, we propose two novel methods: Advantage Reweighting and Low-Probability Token Isolation (Lopti), both of which effectively attenuate gradients from low-probability tokens while emphasizing parameter updates driven by high-probability tokens. Our approaches promote balanced updates…
Peer Reviews
Decision·ICLR 2026 Poster
I think this paper's main claim is new to the field: the gradient of the low probability tokens are dictating the changes in the probability of high probability tokens (not the advantages of those high-probability tokens) because their gradient magnitude is large. The paper shows mitigating this improves results on K&K puzzle. I am not familiar with this puzzle, but it seems it is a challenging task. I think the K&K results are the strongest in supporting the evidence.
The paper points out a very interesting observation. However, the math results are not consistent with the theory: their methods and GRPO are almost achieving the same score. I think it is because they are testing this on a R1-Zero style scenario where they start from a base model. Papers that do RL on base models show this quick recovery of some performance the curves are flat afterwards. I think the paper should have been done on a native reasoning model on R1-Distill-1.5B as they show usually
1. The theoretical derivation and supporting experiments are well aligned; the motivation is lucid and convincing. 2. The proposed methods are low-cost, effective, and easy to implement in practice. 3. The authors provide careful experiments and analysis, including evaluations on multiple algorithms, multiple domains, and several ablations.
See questions below.
1. The research is built on a solid theoretical foundation, with Proposition 4.2 mathematically demonstrating that a token's gradient norm is inversely related to its probability. This elevates empirical observation to a predictable phenomenon. 2. The paper excels at communicating a complex technical subject. The narrative is logical, and the framing of the issue as the "tyranny of the unlikely token" is both memorable and effective.
1. The author claim that low-probability tokens dominate model updates during RL training and that this dominance may impede the precise adjustment of the probability distribution across all tokens. In fact, the high-entropy minority tokens drive effective reinforcement learning for llm reasoning, we do not need to adjust the probability distribution across all tokens based on RL. 2. RL Algorithm Specificity: Experiments were conducted exclusively with GRPO. All experiments used models from t
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Reinforcement Learning in Robotics
