Latent Action Pretraining from Videos
Seonghyeon Ye, Joel Jang, Byeongguk Jeon, Sejune Joo, Jianwei Yang, Baolin Peng, Ajay Mandlekar, Reuben Tan, Yu-Wei Chao, Bill Yuchen Lin, Lars Liden, Kimin Lee, Jianfeng Gao, Luke Zettlemoyer, Dieter Fox, Minjoon Seo

TL;DR
This paper presents an unsupervised pretraining method for vision-language-action models that learns from internet videos without robot action labels, significantly improving manipulation task performance and generalization.
Contribution
Introduces Latent Action Pretraining (LAPA), a novel approach to pretrain VLA models using discrete latent actions learned from videos without ground-truth labels.
Findings
Outperforms existing video-based robot manipulation methods.
Surpasses state-of-the-art models trained with robotic labels.
Enables positive transfer from human videos to robotic tasks.
Abstract
We introduce Latent Action Pretraining for general Action models (LAPA), an unsupervised method for pretraining Vision-Language-Action (VLA) models without ground-truth robot action labels. Existing Vision-Language-Action models require action labels typically collected by human teleoperators during pretraining, which significantly limits possible data sources and scale. In this work, we propose a method to learn from internet-scale videos that do not have robot action labels. We first train an action quantization model leveraging VQ-VAE-based objective to learn discrete latent actions between image frames, then pretrain a latent VLA model to predict these latent actions from observations and task descriptions, and finally finetune the VLA on small-scale robot manipulation data to map from latent to robot actions. Experimental results demonstrate that our method significantly…
Peer Reviews
Decision·ICLR 2025 Poster
Originality: - The proposed method removes the need of action labels for pre-training VLA models which significantly increase the data availability. - Training VQ to predict delta between frames is a simple and scalable way of learning coarse latent action. - A significant performance improvement compared to SoTA (OpenVLA) model under various scenarios and relatively small performance gap between the upper bound case (ActionVLA) and LAPA. Quality: - The proposed method is technically sound. -
- Lack of Experiments on Sequence Length in VQ Stage: There is a lack of experiments illustrating the effect of different sequence lengths during the VQ stage. It seems arbitrary that the latent code length is set to 4 (line 433-434), and for the language table dataset (line 933), the sequence length is set to 1. A discussion on the rationale behind these choices is missing. Incorporating experiments on various sequence lengths could help assess LAPA’s flexibility and robustness. - Limited Abil
The innovative approach of using VQ-VAE to encode image dynamics into latent space and replacing labeled actions with these encoded tokens is particularly intriguing. This method holds significant importance for the research community, given the high costs associated with data collection for action labeling. The experimental validation is comprehensive, with strong results obtained from both simulation environments and real-world settings, underscoring the reliability of the model. The analys
The pretraining and finetuning setups in experiment section is a little confusing. For example, how is ActionVLA pretrained with action labels while there does not exist action labels in in something V2. The utilized finetuning recipe of other baselines is not demonstrated in detail, which makes me concer the fairness of the comparison. I hope the authors could add detailed information in the appendix. All the experiments in simulators are trained with only few trajectories, especially in Brid
- Interesting approach: The proposed approach is both simple and practical, potentially easier to implement than the baselines considered in the experimental section. Pretraining VLAs on actionless data, especially human videos, is particularly relevant, and the use of inferred latent actions is a sensible solution. - Good experimental results: Through extensive comparative and ablation studies in both simulation and real-world robot settings, the authors clearly demonstrate the effectiveness of
Since the latent actions are not directly used for downstream control and the model is finetuned on robot action labels, it’s unclear whether the performance gains come from leveraging temporal information/action priors in videos or simply from pretraining on data (robot trajectories/SSv2) that more closely aligns with the finetuning robot data compared to the base VLM’s original training data. Would a pretraining task without temporal information—such as image captioning— achieve similar result
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Human Motion and Animation · Video Analysis and Summarization
