LlamaRL: A Distributed Asynchronous Reinforcement Learning Framework for Efficient Large-scale LLM Training

Bo Wu; Sid Wang; Yunhao Tang; Jia Ding; Eryk Helenowski; Liang Tan; Tengyu Xu; Tushar Gowda; Zhengxing Chen; Chen Zhu; Xiaocheng Tang; Yundi Qian; Beibei Zhu; Rui Hou

arXiv:2505.24034·cs.LG·June 3, 2025

LlamaRL: A Distributed Asynchronous Reinforcement Learning Framework for Efficient Large-scale LLM Training

Bo Wu, Sid Wang, Yunhao Tang, Jia Ding, Eryk Helenowski, Liang Tan, Tengyu Xu, Tushar Gowda, Zhengxing Chen, Chen Zhu, Xiaocheng Tang, Yundi Qian, Beibei Zhu, Rui Hou

PDF

Open Access

TL;DR

LlamaRL is a scalable, asynchronous distributed RL framework built on PyTorch, enabling efficient large-scale LLM training with significant speed improvements over existing systems, especially for models with hundreds of billions of parameters.

Contribution

The paper introduces LlamaRL, a novel distributed asynchronous RL framework optimized for large-scale LLMs, featuring a single-controller architecture and theoretical efficiency guarantees.

Findings

01

Achieves up to 10.7x speed-up over DeepSpeed-Chat on 405B models.

02

Supports models ranging from 8B to 405B parameters.

03

Efficiency gains increase with model size.

Abstract

Reinforcement Learning (RL) has become the most effective post-training approach for improving the capabilities of Large Language Models (LLMs). In practice, because of the high demands on latency and memory, it is particularly challenging to develop an efficient RL framework that reliably manages policy models with hundreds to thousands of billions of parameters. In this paper, we present LlamaRL, a fully distributed, asynchronous RL framework optimized for efficient training of large-scale LLMs with various model sizes (8B, 70B, and 405B parameters) on GPU clusters ranging from a handful to thousands of devices. LlamaRL introduces a streamlined, single-controller architecture built entirely on native PyTorch, enabling modularity, ease of use, and seamless scalability to thousands of GPUs. We also provide a theoretical analysis of LlamaRL's efficiency, including a formal proof that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFuzzy Logic and Control Systems · Elevator Systems and Control · Speech and dialogue systems