Do Not Let Low-Probability Tokens Over-Dominate in RL for LLMs

Zhihe Yang; Xufang Luo; Zilong Wang; Dongqi Han; Zhiyuan He; Dongsheng Li; Yunjian Xu

arXiv:2505.12929·cs.CL·May 20, 2025

Do Not Let Low-Probability Tokens Over-Dominate in RL for LLMs

Zhihe Yang, Xufang Luo, Zilong Wang, Dongqi Han, Zhiyuan He, Dongsheng Li, Yunjian Xu

PDF

Open Access 1 Repo 1 Datasets 3 Reviews

TL;DR

This paper addresses the issue of low-probability tokens dominating gradients in RL training of LLMs, proposing methods to balance token influence and improve reasoning performance.

Contribution

It introduces Advantage Reweighting and Lopti, two techniques to mitigate low-probability token dominance in RL training of LLMs, enhancing learning efficiency.

Findings

01

Up to 46.2% improvement in reasoning tasks

02

Effective attenuation of low-probability token gradients

03

Enhanced balanced token updates in RL training

Abstract

Reinforcement learning (RL) has become a cornerstone for enhancing the reasoning capabilities of large language models (LLMs), with recent innovations such as Group Relative Policy Optimization (GRPO) demonstrating exceptional effectiveness. In this study, we identify a critical yet underexplored issue in RL training: low-probability tokens disproportionately influence model updates due to their large gradient magnitudes. This dominance hinders the effective learning of high-probability tokens, whose gradients are essential for LLMs' performance but are substantially suppressed. To mitigate this interference, we propose two novel methods: Advantage Reweighting and Low-Probability Token Isolation (Lopti), both of which effectively attenuate gradients from low-probability tokens while emphasizing parameter updates driven by high-probability tokens. Our approaches promote balanced updates…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

I think this paper's main claim is new to the field: the gradient of the low probability tokens are dictating the changes in the probability of high probability tokens (not the advantages of those high-probability tokens) because their gradient magnitude is large. The paper shows mitigating this improves results on K&K puzzle. I am not familiar with this puzzle, but it seems it is a challenging task. I think the K&K results are the strongest in supporting the evidence.

Weaknesses

The paper points out a very interesting observation. However, the math results are not consistent with the theory: their methods and GRPO are almost achieving the same score. I think it is because they are testing this on a R1-Zero style scenario where they start from a base model. Papers that do RL on base models show this quick recovery of some performance the curves are flat afterwards. I think the paper should have been done on a native reasoning model on R1-Distill-1.5B as they show usually

Reviewer 02Rating 8Confidence 3

Strengths

1. The theoretical derivation and supporting experiments are well aligned; the motivation is lucid and convincing. 2. The proposed methods are low-cost, effective, and easy to implement in practice. 3. The authors provide careful experiments and analysis, including evaluations on multiple algorithms, multiple domains, and several ablations.

Weaknesses

See questions below.

Reviewer 03Rating 2Confidence 4

Strengths

1. The research is built on a solid theoretical foundation, with Proposition 4.2 mathematically demonstrating that a token's gradient norm is inversely related to its probability. This elevates empirical observation to a predictable phenomenon. 2. The paper excels at communicating a complex technical subject. The narrative is logical, and the framing of the issue as the "tyranny of the unlikely token" is both memorable and effective.

Weaknesses

1. The author claim that low-probability tokens dominate model updates during RL training and that this dominance may impede the precise adjustment of the probability distribution across all tokens. In fact, the high-entropy minority tokens drive effective reinforcement learning for llm reasoning, we do not need to adjust the probability distribution across all tokens based on RL. 2. RL Algorithm Specificity: Experiments were conducted exclusively with GRPO. All experiments used models from t

Code & Models

Repositories

zhyang2226/ar-lopti
pytorchOfficial

Datasets

happynew111/haotian_data-GPS-AR-Lopti-master
dataset· 55 dl
55 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Reinforcement Learning in Robotics