ConLA: Contrastive Latent Action Learning from Human Videos for Robotic Manipulation
Weisheng Dai, Kai Lan, Jianyi Zhou, Bo Zhao, Xiu Su, Junwen Tong, Weili Guan, Shuo Yang

TL;DR
ConLA is an unsupervised framework that learns disentangled, semantically meaningful latent actions from human videos, enabling scalable robotic policy pretraining that surpasses real robot data in performance.
Contribution
Introducing ConLA, a contrastive disentanglement method that leverages action priors and temporal cues to improve latent action learning from human videos for robotics.
Findings
ConLA outperforms previous methods on multiple benchmarks.
Pretraining on human videos alone surpasses real robot trajectory pretraining.
The method effectively isolates motion dynamics from visual content.
Abstract
Vision-Language-Action (VLA) models achieve preliminary generalization through pretraining on large scale robot teleoperation datasets. However, acquiring datasets that comprehensively cover diverse tasks and environments is extremely costly and difficult to scale. In contrast, human demonstration videos offer a rich and scalable source of diverse scenes and manipulation behaviors, yet their lack of explicit action supervision hinders direct utilization. Prior work leverages VQ-VAE based frameworks to learn latent actions from human videos in an unsupervised manner. Nevertheless, since the training objective primarily focuses on reconstructing visual appearances rather than capturing inter-frame dynamics, the learned representations tend to rely on spurious visual cues, leading to shortcut learning and entangled latent representations that hinder transferability. To address this, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Human Pose and Action Recognition
