LatBot: Distilling Universal Latent Actions for Vision-Language-Action Models

Zuolei Li; Xingyu Gao; Xiaofan Wang; Jianlong Fu

arXiv:2511.23034·cs.RO·December 1, 2025

LatBot: Distilling Universal Latent Actions for Vision-Language-Action Models

Zuolei Li, Xingyu Gao, Xiaofan Wang, Jianlong Fu

PDF

Open Access

TL;DR

This paper introduces LatBot, a framework for learning universal latent actions from videos that improve robot manipulation by capturing physical priors and enabling effective transfer to new tasks with minimal data.

Contribution

It proposes a novel universal latent action learning method that incorporates action predictions and decomposes actions into motion and scene tokens for better transferability.

Findings

01

Achieves strong performance on simulated and real-world tasks.

02

Successfully completes five tasks with only 10 real-world trajectories.

03

Demonstrates effective few-shot transfer in robotic manipulation.

Abstract

Learning transferable latent actions from large-scale object manipulation videos can significantly enhance generalization in downstream robotics tasks, as such representations are agnostic to different robot embodiments. Existing approaches primarily rely on visual reconstruction objectives while neglecting physical priors, leading to sub-optimal performance in learning universal representations. To address these challenges, we propose a Universal Latent Action Learning framework that takes task instructions and multiple frames as inputs, and optimizes both future frame reconstruction and action sequence prediction. Unlike prior works, incorporating action predictions (e.g., gripper or hand trajectories and orientations) allows the model to capture richer physical priors such as real-world distances and orientations, thereby enabling seamless transferability to downstream tasks. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis