AReaL-Hex: Accommodating Asynchronous RL Training over Heterogeneous GPUs
Ran Yan, Youhe Jiang, Tianyuan Wu, Jiaxuan Gao, Zhiyu Mei, Wei Fu, Haohui Mai, Wei Wang, Yi Wu, Binhang Yuan

TL;DR
AReaL-Hex is a heterogeneity-aware asynchronous RL training system that optimally schedules GPU resources, significantly improving throughput and reducing costs for training large language models.
Contribution
It introduces a novel scheduling framework combining MILP and graph partitioning to efficiently utilize heterogeneous GPUs in RL training.
Findings
Up to 1.50x higher training throughput compared to homogeneous systems.
Up to 1.46x reduction in training cost at the same throughput.
Effective mapping of I/O-bound and compute-bound tasks to cost-efficient resources.
Abstract
Maximizing training throughput and cost-efficiency of RL for LLMs is essential to democratize this advanced technique. One promising but challenging approach is to deploy such a computational workflow over heterogeneous GPUs. Unlike conventional large-scale LLM pretraining, RL training generally decomposes into three coupled stages, i.e., rollout generation, reward computation, and policy/value updates, which exhibit markedly different compute intensities, memory footprints, and communication patterns. Recent research shows that fully asynchronous RL training can disaggregate these stages across disjoint hardware pools without sacrificing training stability, creating a great opportunity for real-world heterogeneous deployment. To this end, we present AReaL-Hex, a heterogeneity-aware asynchronous RL training system that effectively schedules how to execute rollout generation and policy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Graph Theory and Algorithms · Cloud Computing and Resource Management
