LatBot: Distilling Universal Latent Actions for Vision-Language-Action Models
Zuolei Li, Xingyu Gao, Xiaofan Wang, Jianlong Fu

TL;DR
This paper introduces LatBot, a framework for learning universal latent actions from videos that improve robot manipulation by capturing physical priors and enabling effective transfer to new tasks with minimal data.
Contribution
It proposes a novel universal latent action learning method that incorporates action predictions and decomposes actions into motion and scene tokens for better transferability.
Findings
Achieves strong performance on simulated and real-world tasks.
Successfully completes five tasks with only 10 real-world trajectories.
Demonstrates effective few-shot transfer in robotic manipulation.
Abstract
Learning transferable latent actions from large-scale object manipulation videos can significantly enhance generalization in downstream robotics tasks, as such representations are agnostic to different robot embodiments. Existing approaches primarily rely on visual reconstruction objectives while neglecting physical priors, leading to sub-optimal performance in learning universal representations. To address these challenges, we propose a Universal Latent Action Learning framework that takes task instructions and multiple frames as inputs, and optimizes both future frame reconstruction and action sequence prediction. Unlike prior works, incorporating action predictions (e.g., gripper or hand trajectories and orientations) allows the model to capture richer physical priors such as real-world distances and orientations, thereby enabling seamless transferability to downstream tasks. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis
