ViSA-Flow: Accelerating Robot Skill Learning via Large-Scale Video Semantic Action Flow

Changhe Chen; Quantao Yang; Xiaohao Xu; Nima Fazeli; Olov Andersson

arXiv:2505.01288·cs.RO·November 13, 2025

ViSA-Flow: Accelerating Robot Skill Learning via Large-Scale Video Semantic Action Flow

Changhe Chen, Quantao Yang, Xiaohao Xu, Nima Fazeli, Olov Andersson

PDF

Open Access

TL;DR

ViSA-Flow introduces a self-supervised framework that learns semantic action flows from large-scale human videos and adapts this knowledge to robots, enabling efficient skill learning with minimal robot demonstrations.

Contribution

The paper presents a novel semantic action flow representation and a transfer learning framework that leverages large-scale human videos for robot skill acquisition.

Findings

01

State-of-the-art performance on CALVIN benchmark

02

Effective transfer from human videos to robots

03

Strong results in low-data regimes

Abstract

One of the central challenges preventing robots from acquiring complex manipulation skills is the prohibitive cost of collecting large-scale robot demonstrations. In contrast, humans are able to learn efficiently by watching others interact with their environment. To bridge this gap, we introduce semantic action flow as a core intermediate representation capturing the essential spatio-temporal manipulator-object interactions, invariant to superficial visual differences. We present ViSA-Flow, a framework that learns this representation self-supervised from unlabeled large-scale video data. First, a generative model is pre-trained on semantic action flows automatically extracted from large-scale human-object interaction video data, learning a robust prior over manipulation structure. Second, this prior is efficiently adapted to a target robot by fine-tuning on a small set of robot…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Multimodal Machine Learning Applications · Social Robot Interaction and HRI

MethodsSparse Evolutionary Training