Learning while Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies

Yi Wang; Xinchen Li; Pengwei Xie; Pu Yang; Buqing Nie; Yunuo Cai; Qinglin Zhang; Chendi Qu; Jeffrey Wu; Jianheng Song; Xinlin Ren; Jingshun Huang; Mingjie Pan; Siyuan Feng; Zhi Chen; Jianlan Luo

arXiv:2605.00416·cs.RO·May 4, 2026

Learning while Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies

Yi Wang, Xinchen Li, Pengwei Xie, Pu Yang, Buqing Nie, Yunuo Cai, Qinglin Zhang, Chendi Qu, Jeffrey Wu, Jianheng Song, Xinlin Ren, Jingshun Huang, Mingjie Pan, Siyuan Feng, Zhi Chen, Jianlan Luo

PDF

TL;DR

This paper introduces Learning While Deploying (LWD), a fleet-scale reinforcement learning framework that continually improves generalist robot policies through autonomous deployment, human intervention, and online learning across multiple real-world tasks.

Contribution

The paper presents a novel fleet-scale offline-to-online RL framework that enables continual policy improvement for generalist robots during deployment, combining value estimation and policy extraction techniques.

Findings

01

A single generalist policy achieved 95% success rate across tasks.

02

LWD improved performance on long-horizon manipulation tasks.

03

Fleet experience led to continuous policy enhancement.

Abstract

Generalist robot policies increasingly benefit from large-scale pretraining, but offline data alone is insufficient for robust real-world deployment. Deployed robots encounter distribution shifts, long-tail failures, task variations, and human correction opportunities that fixed demonstration datasets cannot fully capture. We present Learning While Deploying (LWD), a fleet-scale offline-to-online reinforcement learning framework for continual post-training of generalist Vision-Language-Action (VLA) policies. Starting from a pretrained VLA policy, LWD closes the loop between deployment, shared physical experience, policy improvement, and redeployment by using autonomous rollouts and human interventions collected across a robot fleet. To stabilize learning from heterogeneous, sparse-reward fleet data, LWD combines Distributional Implicit Value Learning (DIVL) for robust value estimation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.