RLAX: Large-Scale, Distributed Reinforcement Learning for Large Language Models on TPUs

Runlong Zhou; Lefan Zhang; Shang-Chen Wu; Kelvin Zou; Hanzhi Zhou; Ke Ye; Yihao Feng; Dong Yin; Alex Guillen Garcia; Dmytro Babych; Rohit Chatterjee; Matthew Hopkins; Xiang Kong; Chang Lan; Lezhi Li; Yiping Ma; Daniele Molinari; Senyu Tong; Yanchao Sun; Thomas Voice; Jianyu Wang; Chong Wang; Simon Wang; Floris Weers; Yechen Xu; Guolin Yin; Muyang Yu; Yi Zhang; Zheng Zhou; Danyang Zhuo; Ruoming Pang; Cheng Leong

arXiv:2512.06392·cs.LG·December 12, 2025

RLAX: Large-Scale, Distributed Reinforcement Learning for Large Language Models on TPUs

Runlong Zhou, Lefan Zhang, Shang-Chen Wu, Kelvin Zou, Hanzhi Zhou, Ke Ye, Yihao Feng, Dong Yin, Alex Guillen Garcia, Dmytro Babych, Rohit Chatterjee, Matthew Hopkins, Xiang Kong, Chang Lan, Lezhi Li, Yiping Ma, Daniele Molinari, Senyu Tong, Yanchao Sun, Thomas Voice, Jianyu Wang

PDF

Open Access

TL;DR

RLAX is a scalable, distributed reinforcement learning framework on TPUs that enhances large language models' reasoning abilities through system innovations and dataset techniques, achieving significant accuracy improvements efficiently.

Contribution

Developed RLAX, a novel scalable RL framework on TPUs with system optimizations and data curation methods for large language models.

Findings

01

RLAX improves QwQ-32B's pass@8 accuracy by 12.8% in under 13 hours.

02

Achieves scalable RL training with robustness to preemptions.

03

Demonstrates effective large-scale RL on TPU infrastructure.

Abstract

Reinforcement learning (RL) has emerged as the de-facto paradigm for improving the reasoning capabilities of large language models (LLMs). We have developed RLAX, a scalable RL framework on TPUs. RLAX employs a parameter-server architecture. A master trainer periodically pushes updated model weights to the parameter server while a fleet of inference workers pull the latest weights and generates new rollouts. We introduce a suite of system techniques to enable scalable and preemptible RL for a diverse set of state-of-art RL algorithms. To accelerate convergence and improve model quality, we have devised new dataset curation and alignment techniques. Large-scale evaluations show that RLAX improves QwQ-32B's pass@8 accuracy by 12.8% in just 12 hours 48 minutes on 1024 v5p TPUs, while remaining robust to preemptions during training.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Reinforcement Learning in Robotics · Multimodal Machine Learning Applications