Scalable Vision-Language-Action Model Pretraining for Robotic Manipulation with Real-Life Human Activity Videos

Qixiu Li; Yu Deng; Yaobo Liang; Lin Luo; Lei Zhou; Chengtang Yao; Lingqi Zeng; Zhiyuan Feng; Huizhi Liang; Sicheng Xu; Yizhong Zhang; Xi Chen; Hao Chen; Lily Sun; Dong Chen; Jiaolong Yang; Baining Guo

arXiv:2510.21571·cs.RO·October 27, 2025

Scalable Vision-Language-Action Model Pretraining for Robotic Manipulation with Real-Life Human Activity Videos

Qixiu Li, Yu Deng, Yaobo Liang, Lin Luo, Lei Zhou, Chengtang Yao, Lingqi Zeng, Zhiyuan Feng, Huizhi Liang, Sicheng Xu, Yizhong Zhang, Xi Chen, Hao Chen, Lily Sun, Dong Chen, Jiaolong Yang, Baining Guo

PDF

Open Access 2 Models 2 Datasets

TL;DR

This paper introduces a scalable pretraining approach for robotic vision-language-action models using large-scale, unannotated human activity videos, enabling robots to learn diverse manipulation skills with minimal supervision.

Contribution

It develops an automated method to convert egocentric human videos into aligned robotic VLA training data, significantly expanding available datasets for robotic manipulation learning.

Findings

01

Model exhibits strong zero-shot generalization to unseen real-world observations.

02

Fine-tuning on small robot datasets improves task success and object generalization.

03

Performance scales positively with increased pretraining data volume.

Abstract

This paper presents a novel approach for pretraining robotic manipulation Vision-Language-Action (VLA) models using a large corpus of unscripted real-life video recordings of human hand activities. Treating human hand as dexterous robot end-effector, we show that "in-the-wild" egocentric human videos without any annotations can be transformed into data formats fully aligned with existing robotic V-L-A training data in terms of task granularity and labels. This is achieved by the development of a fully-automated holistic human activity analysis approach for arbitrary human hand videos. This approach can generate atomic-level hand activity segments and their language descriptions, each accompanied with framewise 3D hand motion and camera motion. We process a large volume of egocentric videos and create a hand-VLA training dataset containing 1M episodes and 26M frames. This training data…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Robot Manipulation and Learning · Multimodal Machine Learning Applications