villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models

Xiaoyu Chen; Hangxing Wei; Pushi Zhang; Chuheng Zhang; Kaixin Wang; Yanjiang Guo; Rushuai Yang; Yucen Wang; Xinquan Xiao; Li Zhao; Jianyu Chen; Jiang Bian

arXiv:2507.23682·cs.RO·September 26, 2025

villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models

Xiaoyu Chen, Hangxing Wei, Pushi Zhang, Chuheng Zhang, Kaixin Wang, Yanjiang Guo, Rushuai Yang, Yucen Wang, Xinquan Xiao, Li Zhao, Jianyu Chen, Jiang Bian

PDF

Open Access 1 Models 3 Reviews

TL;DR

villa-X introduces a novel framework that enhances latent action modeling in vision-language-action models, enabling zero-shot generalization and superior manipulation performance in diverse robotic tasks.

Contribution

The paper presents villa-X, a new ViLLA framework that improves latent action learning and integration, advancing generalization in robot manipulation policies.

Findings

01

Zero-shot latent action planning for unseen embodiments

02

Superior performance on diverse simulation tasks

03

Effective real-world robotic manipulation

Abstract

Vision-Language-Action (VLA) models have emerged as a popular paradigm for learning robot manipulation policies that can follow language instructions and generalize to novel scenarios. Recent works have begun to explore the incorporation of latent actions, abstract representations of motion between two frames, into VLA pre-training. In this paper, we introduce villa-X, a novel Vision-Language-Latent-Action (ViLLA) framework that advances latent action modeling for learning generalizable robot manipulation policies. Our approach improves both how latent actions are learned and how they are incorporated into VLA pre-training. We demonstrate that villa-X can generate latent action plans in a zero-shot fashion, even for unseen embodiments and open-vocabulary symbolic understanding. This capability enables villa-X to achieve superior performance across diverse simulation tasks in SIMPLER and…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

- The proposed approach has strong empirical results. In particular, in Table 2, the proposed approach outperforms state-of-the-art VLA models, such as OpenVLA, OpenVLA-OFT, $\pi_0$, and Gr00T. It also outperforms other algorithms such as MoTo and LAPA. - Using an embodiment-specific embedding is an interesting approach to leveraging diverse robot datasets, where different datasets have slightly different dynamics and action spaces. Ablation study shows that the embodiment embedding has a posit

Weaknesses

- Although the main design components are validated through ablation studies. Some finer-grained design choices lack discussion and study. See more details in the **Questions** section. - Figure 3, probing experiment results, is not very easy to understand. Perhaps a plot showing the distribution of error in different intervals, for both w/ pp and w/o pp, would be more informative. It is also not very convincing why $L_\infty$ (max across all dimensions) is preferred over $L_1$ (summing/averag

Reviewer 02Rating 4Confidence 4

Strengths

1. The paper is well-structured and clearly written. The problem motivation is sound, the technical approach is explained logically. 2. The evaluation is thorough, encompassing systematic ablations, major simulation benchmarks (SIMPLER, LIBERO), and real-world deployment on two distinct platforms. 3. The demonstrated capability for zero-shot generalization to novel embodiments addresses a core challenge in the field.

Weaknesses

1. The technical contributions, while valuable, exhibit limited novelty relative to existing literature. The proposed proprioceptive Forward Dynamics Model (proprio-FDM), which grounds latent actions by predicting low-level states, is conceptually similar to the approach of Nikulin et al. [1], who employ a linear decoder on latent tokens to predict actions. The efficacy of this general principle for grounding has also been previously analyzed by Zhang et al. [2]. Furthermore, the architectural d

Reviewer 03Rating 8Confidence 4

Strengths

1. The system design is simple, effective, and scalable. 2. The experiments are sufficient and comprehensive. 3. The writing is clear and provides a good reading experience.

Weaknesses

See questions.

Code & Models

Models

🤗
microsoft/villa-x
model· ♡ 8
♡ 8

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Explainable Artificial Intelligence (XAI)