QuRL: Efficient Reinforcement Learning with Quantized Rollout
Yuhang Li, Reena Elangovan, Xin Dong, Priyadarshini Panda, Brucek Khailany

TL;DR
QuRL introduces a quantized actor to accelerate reinforcement learning rollouts in large language models, employing adaptive clipping and invariant scaling to maintain training stability and achieve significant speedups.
Contribution
The paper presents QuRL, a novel method using quantization techniques with adaptive clipping and invariant scaling to improve RL training efficiency for LLMs.
Findings
Achieves 20% to 80% faster rollout during training.
Effective mitigation of quantization noise and training collapse.
Demonstrates benefits with INT8 and FP8 quantization.
Abstract
Reinforcement learning with verifiable rewards (RLVR) has become a trending paradigm for training reasoning large language models (LLMs). However, due to the autoregressive decoding nature of LLMs, the rollout process becomes the efficiency bottleneck of RL training, consisting of up to 70\% of the total training time. In this work, we propose Quantized Reinforcement Learning (QuRL) that uses a quantized actor for accelerating the rollout. We address two challenges in QuRL. First, we propose Adaptive Clipping Range (ACR) that dynamically adjusts the clipping ratio based on the policy ratio between the full-precision actor and the quantized actor, which is essential for mitigating long-term training collapse. Second, we identify the weight update problem, where weight changes between RL steps are extremely small, making it difficult for the quantization operation to capture them…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI)
