Thousand-GPU Large-Scale Training and Optimization Recipe for AI-Native Cloud Embodied Intelligence Infrastructure

Yongjian Guo; Yunxuan Ma; Haoran Sun; Zhong Guan; Shuai Di; Jing Long; Wanting Xu; Xiaodong Bai; Wen Huang; Yucheng Guo; Chen Zhou; Qiming Yang; Mingxi Luo; Tianyun Zhao; Hedan Yang; Song Wang; Xiaomeng Tian; Xiaolong Xiang; Zhen Sun; Yu Wei; Luqiao Wang; Yuzhen Li; Chenfeng Gu; Junwu Xiong; Yicheng Gong

arXiv:2603.11101·cs.RO·March 20, 2026

Thousand-GPU Large-Scale Training and Optimization Recipe for AI-Native Cloud Embodied Intelligence Infrastructure

Yongjian Guo, Yunxuan Ma, Haoran Sun, Zhong Guan, Shuai Di, Jing Long, Wanting Xu, Xiaodong Bai, Wen Huang, Yucheng Guo, Chen Zhou, Qiming Yang, Mingxi Luo, Tianyun Zhao, Hedan Yang, Song Wang, Xiaomeng Tian, Xiaolong Xiang, Zhen Sun, Yu Wei, Luqiao Wang, Yuzhen Li, Chenfeng Gu

PDF

Open Access

TL;DR

This paper presents a thousand-GPU distributed training platform for embodied intelligence, achieving significant speedups through system optimization, novel techniques, and infrastructure integration, advancing towards AI-native cloud embodied intelligence.

Contribution

It introduces the first industry-scale thousand-GPU training platform for embodied intelligence, with systematic bottleneck solutions and innovative speedup techniques.

Findings

01

40-fold reduction in training time for large models

02

188% speed increase via variable-length FlashAttention and Data Packing

03

Achieved end-to-end validation on thousand-GPU clusters

Abstract

Embodied intelligence is a key step towards Artificial General Intelligence (AGI), yet its development faces multiple challenges including data, frameworks, infrastructure, and evaluation systems. To address these issues, we have, for the first time in the industry, launched a cloud-based, thousand-GPU distributed training platform for embodied intelligence, built upon the widely adopted LeRobot framework, and have systematically overcome bottlenecks across the entire pipeline. At the data layer, we have restructured the data pipeline to optimize the flow of embodied training data. In terms of training, for the GR00T-N1.5 model, utilizing thousand-GPU clusters and data at the scale of hundreds of millions, the single-round training time has been reduced from 15 hours to just 22 minutes, achieving a 40-fold speedup. At the model layer, by combining variable-length FlashAttention and Data…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Big Data and Digital Economy · IoT and Edge/Fog Computing