TensorHub: Scalable and Elastic Weight Transfer for LLM RL Training

Chenhao Ye; Huaizheng Zhang; Mingcong Han; Baoquan Zhong; Xiang Li; Qixiang Chen; Xinyi Zhang; Weidong Zhang; Kaihua Jiang; Wang Zhang; He Sun; Wencong Xiao; Andrea C. Arpaci-Dusseau; Remzi H. Arpaci-Dusseau

arXiv:2604.09107·cs.DC·April 13, 2026

TensorHub: Scalable and Elastic Weight Transfer for LLM RL Training

Chenhao Ye, Huaizheng Zhang, Mingcong Han, Baoquan Zhong, Xiang Li, Qixiang Chen, Xinyi Zhang, Weidong Zhang, Kaihua Jiang, Wang Zhang, He Sun, Wencong Xiao, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau

PDF

TL;DR

TensorHub introduces a novel, efficient weight transfer system for large language model reinforcement learning, significantly improving scalability and performance across heterogeneous resources.

Contribution

The paper presents Reference-Oriented Storage (ROS) and TensorHub, enabling flexible, high-performance weight transfer without data movement overhead in RL training.

Findings

01

TensorHub fully saturates RDMA bandwidth.

02

Reduces GPU stall time by up to 6.7x.

03

Accelerates weight updates by 4.8x and cuts cross-datacenter stall time by 19x.

Abstract

Modern LLM reinforcement learning (RL) workloads require a highly efficient weight transfer system to scale training across heterogeneous computational resources. However, existing weight transfer approaches either fail to provide flexibility for dynamically scaling clusters or incur fundamental data movement overhead, resulting in poor performance. We introduce Reference-Oriented Storage (ROS), a new storage abstraction for RL weight transfer that exploits the highly replicated model weights in place. ROS presents the illusion that certain versions of the model weights are stored and can be fetched on demand. Underneath, ROS does not physically store any copies of the weights; instead, it tracks the workers that hold these weights on GPUs for inference. Upon request, ROS directly uses them to serve reads. We build TensorHub, a production-quality system that extends the ROS idea with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.