ConLA: Contrastive Latent Action Learning from Human Videos for Robotic Manipulation

Weisheng Dai; Kai Lan; Jianyi Zhou; Bo Zhao; Xiu Su; Junwen Tong; Weili Guan; Shuo Yang

arXiv:2602.00557·cs.RO·February 3, 2026

ConLA: Contrastive Latent Action Learning from Human Videos for Robotic Manipulation

Weisheng Dai, Kai Lan, Jianyi Zhou, Bo Zhao, Xiu Su, Junwen Tong, Weili Guan, Shuo Yang

PDF

Open Access

TL;DR

ConLA is an unsupervised framework that learns disentangled, semantically meaningful latent actions from human videos, enabling scalable robotic policy pretraining that surpasses real robot data in performance.

Contribution

Introducing ConLA, a contrastive disentanglement method that leverages action priors and temporal cues to improve latent action learning from human videos for robotics.

Findings

01

ConLA outperforms previous methods on multiple benchmarks.

02

Pretraining on human videos alone surpasses real robot trajectory pretraining.

03

The method effectively isolates motion dynamics from visual content.

Abstract

Vision-Language-Action (VLA) models achieve preliminary generalization through pretraining on large scale robot teleoperation datasets. However, acquiring datasets that comprehensively cover diverse tasks and environments is extremely costly and difficult to scale. In contrast, human demonstration videos offer a rich and scalable source of diverse scenes and manipulation behaviors, yet their lack of explicit action supervision hinders direct utilization. Prior work leverages VQ-VAE based frameworks to learn latent actions from human videos in an unsupervised manner. Nevertheless, since the training objective primarily focuses on reconstructing visual appearances rather than capturing inter-frame dynamics, the learned representations tend to rely on spurious visual cues, leading to shortcut learning and entangled latent representations that hinder transferability. To address this, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Human Pose and Action Recognition