Dynamic Vocabulary Pruning: Stable LLM-RL by Taming the Tail
Yingru Li, Jiawei Xu, Jiacai Liu, Yuxuan Tong, Ziniu Li, Tianle Cai, Ge Zhang, Qian Liu, Baoxiang Wang

TL;DR
This paper introduces Dynamic Vocabulary Pruning (DVP), a method that stabilizes reinforcement learning for large language models by excluding tail tokens to reduce divergence and bias, ensuring more reliable training.
Contribution
The paper proposes DVP, a novel approach that dynamically prunes vocabulary tail tokens to improve RL stability in LLMs, with theoretical bias bounds and empirical validation.
Findings
DVP stabilizes RL training for LLMs.
Theoretical bounds on bias introduced by pruning.
Empirical results show improved training stability.
Abstract
Reinforcement Learning (RL) for Large Language Models (LLMs) faces a fundamental tension: the numerical divergence between high-throughput inference engines and numerically precise training engines. Although these systems share the same parameters, they produce slightly different probability distributions, creating a training-inference mismatch. We prove that the bound on the log-probability divergence arising from this mismatch scales as , where is the token probability. This scaling induces a highly asymmetric effect: the bound vanishes for high-probability tokens but remains significant for low-probability tokens in the distribution tail. When sampled, these tail tokens introduce systematically biased errors that accumulate over sequences, thereby destabilizing gradient estimation. Instead of applying post-hoc corrections, we propose Dynamic Vocabulary Pruning (DVP), which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning and Algorithms
