CLAP: Contrastive Latent Action Pretraining for Learning Vision-Language-Action Models from Human Videos

Chubin Zhang; Jianan Wang; Zifeng Gao; Yue Su; Tianru Dai; Cai Zhou; Jiwen Lu; Yansong Tang

arXiv:2601.04061·cs.RO·January 8, 2026

CLAP: Contrastive Latent Action Pretraining for Learning Vision-Language-Action Models from Human Videos

Chubin Zhang, Jianan Wang, Zifeng Gao, Yue Su, Tianru Dai, Cai Zhou, Jiwen Lu, Yansong Tang

PDF

Open Access

TL;DR

CLAP introduces a contrastive learning framework that aligns human video representations with robot actions, enabling effective transfer of manipulation skills from videos to robots, and offers models for instruction following and precise manipulation.

Contribution

The paper presents CLAP, a novel contrastive pretraining method that aligns visual and proprioceptive latent spaces for improved robotic skill transfer from human videos.

Findings

01

CLAP outperforms baselines in transferring skills from videos to robots.

02

The CLAP-NTP model excels in instruction following and object generalization.

03

CLAP-RF achieves high-frequency, precise robotic manipulation.

Abstract

Generalist Vision-Language-Action models are currently hindered by the scarcity of robotic data compared to the abundance of human video demonstrations. Existing Latent Action Models attempt to leverage video data but often suffer from visual entanglement, capturing noise rather than manipulation skills. To address this, we propose Contrastive Latent Action Pretraining (CLAP), a framework that aligns the visual latent space from videos with a proprioceptive latent space from robot trajectories. By employing contrastive learning, CLAP maps video transitions onto a quantized, physically executable codebook. Building on this representation, we introduce a dual-formulation VLA framework offering both CLAP-NTP, an autoregressive model excelling at instruction following and object generalization, and CLAP-RF, a Rectified Flow-based policy designed for high-frequency, precise manipulation.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Robot Manipulation and Learning