Emergence of Human to Robot Transfer in Vision-Language-Action Models

Simar Kareer; Karl Pertsch; James Darpinian; Judy Hoffman; Danfei Xu; Sergey Levine; Chelsea Finn; Suraj Nair

arXiv:2512.22414·cs.RO·December 30, 2025

Emergence of Human to Robot Transfer in Vision-Language-Action Models

Simar Kareer, Karl Pertsch, James Darpinian, Judy Hoffman, Danfei Xu, Sergey Levine, Chelsea Finn, Suraj Nair

PDF

Open Access

TL;DR

This paper investigates how vision-language-action models can develop human-to-robot transfer capabilities through large-scale pretraining on diverse data, enabling better generalization and skill transfer from humans to robots.

Contribution

It introduces a simple co-training method demonstrating that human-to-robot transfer emerges with sufficient pretraining diversity in scenes, tasks, and embodiments.

Findings

01

Emergent human-to-robot transfer capability after large-scale pretraining.

02

Pretraining produces embodiment-agnostic representations for humans and robots.

03

Nearly doubled performance on generalization tasks with diverse robot pretraining.

Abstract

Vision-language-action (VLA) models can enable broad open world generalization, but require large and diverse datasets. It is appealing to consider whether some of this data can come from human videos, which cover diverse real-world situations and are easy to obtain. However, it is difficult to train VLAs with human videos alone, and establishing a mapping between humans and robots requires manual engineering and presents a major research challenge. Drawing inspiration from advances in large language models, where the ability to learn from diverse supervision emerges with scale, we ask whether a similar phenomenon holds for VLAs that incorporate human video data. We introduce a simple co-training recipe, and find that human-to-robot transfer emerges once the VLA is pre-trained on sufficient scenes, tasks, and embodiments. Our analysis suggests that this emergent capability arises…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Social Robot Interaction and HRI