Latent Action Pretraining from Videos

Seonghyeon Ye; Joel Jang; Byeongguk Jeon; Sejune Joo; Jianwei Yang; Baolin Peng; Ajay Mandlekar; Reuben Tan; Yu-Wei Chao; Bill Yuchen Lin; Lars Liden; Kimin Lee; Jianfeng Gao; Luke Zettlemoyer; Dieter Fox; Minjoon Seo

arXiv:2410.11758·cs.RO·May 16, 2025

Latent Action Pretraining from Videos

Seonghyeon Ye, Joel Jang, Byeongguk Jeon, Sejune Joo, Jianwei Yang, Baolin Peng, Ajay Mandlekar, Reuben Tan, Yu-Wei Chao, Bill Yuchen Lin, Lars Liden, Kimin Lee, Jianfeng Gao, Luke Zettlemoyer, Dieter Fox, Minjoon Seo

PDF

Open Access 1 Repo 1 Models 3 Reviews

TL;DR

This paper presents an unsupervised pretraining method for vision-language-action models that learns from internet videos without robot action labels, significantly improving manipulation task performance and generalization.

Contribution

Introduces Latent Action Pretraining (LAPA), a novel approach to pretrain VLA models using discrete latent actions learned from videos without ground-truth labels.

Findings

01

Outperforms existing video-based robot manipulation methods.

02

Surpasses state-of-the-art models trained with robotic labels.

03

Enables positive transfer from human videos to robotic tasks.

Abstract

We introduce Latent Action Pretraining for general Action models (LAPA), an unsupervised method for pretraining Vision-Language-Action (VLA) models without ground-truth robot action labels. Existing Vision-Language-Action models require action labels typically collected by human teleoperators during pretraining, which significantly limits possible data sources and scale. In this work, we propose a method to learn from internet-scale videos that do not have robot action labels. We first train an action quantization model leveraging VQ-VAE-based objective to learn discrete latent actions between image frames, then pretrain a latent VLA model to predict these latent actions from observations and task descriptions, and finally finetune the VLA on small-scale robot manipulation data to map from latent to robot actions. Experimental results demonstrate that our method significantly…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 4

Strengths

Originality: - The proposed method removes the need of action labels for pre-training VLA models which significantly increase the data availability. - Training VQ to predict delta between frames is a simple and scalable way of learning coarse latent action. - A significant performance improvement compared to SoTA (OpenVLA) model under various scenarios and relatively small performance gap between the upper bound case (ActionVLA) and LAPA. Quality: - The proposed method is technically sound. -

Weaknesses

- Lack of Experiments on Sequence Length in VQ Stage: There is a lack of experiments illustrating the effect of different sequence lengths during the VQ stage. It seems arbitrary that the latent code length is set to 4 (line 433-434), and for the language table dataset (line 933), the sequence length is set to 1. A discussion on the rationale behind these choices is missing. Incorporating experiments on various sequence lengths could help assess LAPA’s flexibility and robustness. - Limited Abil

Reviewer 02Rating 6Confidence 4

Strengths

The innovative approach of using VQ-VAE to encode image dynamics into latent space and replacing labeled actions with these encoded tokens is particularly intriguing. This method holds significant importance for the research community, given the high costs associated with data collection for action labeling. The experimental validation is comprehensive, with strong results obtained from both simulation environments and real-world settings, underscoring the reliability of the model. The analys

Weaknesses

The pretraining and finetuning setups in experiment section is a little confusing. For example, how is ActionVLA pretrained with action labels while there does not exist action labels in in something V2. The utilized finetuning recipe of other baselines is not demonstrated in detail, which makes me concer the fairness of the comparison. I hope the authors could add detailed information in the appendix. All the experiments in simulators are trained with only few trajectories, especially in Brid

Reviewer 03Rating 8Confidence 4

Strengths

- Interesting approach: The proposed approach is both simple and practical, potentially easier to implement than the baselines considered in the experimental section. Pretraining VLAs on actionless data, especially human videos, is particularly relevant, and the use of inferred latent actions is a sensible solution. - Good experimental results: Through extensive comparative and ablation studies in both simulation and real-world robot settings, the authors clearly demonstrate the effectiveness of

Weaknesses

Since the latent actions are not directly used for downstream control and the model is finetuned on robot action labels, it’s unclear whether the performance gains come from leveraging temporal information/action priors in videos or simply from pretraining on data (robot trajectories/SSv2) that more closely aligns with the finetuning robot data compared to the base VLM’s original training data. Would a pretraining task without temporal information—such as image captioning— achieve similar result

Code & Models

Repositories

LatentActionPretraining/LAPA
jax

Models

🤗
latent-action-pretraining/LAPA-7B-openx
model· ♡ 15
♡ 15

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Human Motion and Animation · Video Analysis and Summarization