CoMo: Learning Continuous Latent Motion from Internet Videos for Scalable Robot Learning

Jiange Yang; Yansong Shi; Haoyi Zhu; Mingyu Liu; Kaijing Ma; Yating Wang; Gangshan Wu; Tong He; Limin Wang

arXiv:2505.17006·cs.CV·March 30, 2026

CoMo: Learning Continuous Latent Motion from Internet Videos for Scalable Robot Learning

Jiange Yang, Yansong Shi, Haoyi Zhu, Mingyu Liu, Kaijing Ma, Yating Wang, Gangshan Wu, Tong He, Limin Wang

PDF

TL;DR

CoMo introduces a novel unsupervised learning framework that captures continuous latent motion from internet videos, improving robot learning by focusing on foreground dynamics and enabling zero-shot generalization.

Contribution

It proposes the CoMo method with temporal difference and contrastive learning to better capture continuous motion and enhance zero-shot policy transfer in robot learning.

Findings

01

CoMo achieves superior motion representation quality.

02

Policies trained with CoMo pseudo labels outperform baselines.

03

Strong zero-shot generalization demonstrated on unseen videos.

Abstract

Unsupervised learning of latent motion from Internet videos is crucial for robot learning. Existing discrete methods generally mitigate the shortcut learning caused by extracting excessive static backgrounds through vector quantization with a small codebook size. However, they suffer from information loss and struggle to capture more complex and fine-grained dynamics. Moreover, there is an inherent gap between the distribution of discrete latent motion and continuous robot action, which hinders the joint learning of a unified policy. We propose CoMo, which aims to learn more precise continuous latent motion from internet-scale videos. CoMo employs an early temporal difference (Td) mechanism to increase the shortcut learning difficulty and explicitly enhance motion cues. Additionally, to ensure latent motion better captures meaningful foregrounds, we further propose a temporal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.