AIS: Adaptive Importance Sampling for Quantized RL
Jiajun Zhou, Wei Shao, Lingchao Zheng, Yuwei Fan, Ngai Wong

TL;DR
This paper introduces Adaptive Importance Sampling (AIS), a correction framework for quantized reinforcement learning that balances exploration and stability, improving training efficiency for large language models.
Contribution
AIS dynamically adjusts gradient correction during training, mitigating bias from low-precision rollouts while maintaining exploration benefits.
Findings
AIS matches BF16 baseline performance on most tasks.
AIS retains 1.5 to 2.76x rollout speedup of FP8.
AIS improves training stability on reasoning benchmarks.
Abstract
Reinforcement learning (RL) for large language models (LLMs) is dominated by the cost of rollout generation, which has motivated the use of low-precision rollouts (e.g., FP8) paired with a BF16 trainer to improve throughput and reduce memory pressure. This introduces a rollout-training mismatch that biases the policy gradient and can cause training to collapse outright on reasoning benchmarks. We show that the mismatch is non-stationary and acts as a double-edged sword: early in training it provides a stochastic exploration bonus, exposing the gradient to trajectories the trainer would otherwise under-sample, but the same perturbation transitions into a destabilizing source of bias as the policy concentrates. To solve this, we propose Adaptive Importance Sampling (AIS), a correction framework that adjusts the strength of its intervention on a per-batch basis. AIS combines three…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
