villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models
Xiaoyu Chen, Hangxing Wei, Pushi Zhang, Chuheng Zhang, Kaixin Wang, Yanjiang Guo, Rushuai Yang, Yucen Wang, Xinquan Xiao, Li Zhao, Jianyu Chen, Jiang Bian

TL;DR
villa-X introduces a novel framework that enhances latent action modeling in vision-language-action models, enabling zero-shot generalization and superior manipulation performance in diverse robotic tasks.
Contribution
The paper presents villa-X, a new ViLLA framework that improves latent action learning and integration, advancing generalization in robot manipulation policies.
Findings
Zero-shot latent action planning for unseen embodiments
Superior performance on diverse simulation tasks
Effective real-world robotic manipulation
Abstract
Vision-Language-Action (VLA) models have emerged as a popular paradigm for learning robot manipulation policies that can follow language instructions and generalize to novel scenarios. Recent works have begun to explore the incorporation of latent actions, abstract representations of motion between two frames, into VLA pre-training. In this paper, we introduce villa-X, a novel Vision-Language-Latent-Action (ViLLA) framework that advances latent action modeling for learning generalizable robot manipulation policies. Our approach improves both how latent actions are learned and how they are incorporated into VLA pre-training. We demonstrate that villa-X can generate latent action plans in a zero-shot fashion, even for unseen embodiments and open-vocabulary symbolic understanding. This capability enables villa-X to achieve superior performance across diverse simulation tasks in SIMPLER and…
Peer Reviews
Decision·ICLR 2026 Poster
- The proposed approach has strong empirical results. In particular, in Table 2, the proposed approach outperforms state-of-the-art VLA models, such as OpenVLA, OpenVLA-OFT, $\pi_0$, and Gr00T. It also outperforms other algorithms such as MoTo and LAPA. - Using an embodiment-specific embedding is an interesting approach to leveraging diverse robot datasets, where different datasets have slightly different dynamics and action spaces. Ablation study shows that the embodiment embedding has a posit
- Although the main design components are validated through ablation studies. Some finer-grained design choices lack discussion and study. See more details in the **Questions** section. - Figure 3, probing experiment results, is not very easy to understand. Perhaps a plot showing the distribution of error in different intervals, for both w/ pp and w/o pp, would be more informative. It is also not very convincing why $L_\infty$ (max across all dimensions) is preferred over $L_1$ (summing/averag
1. The paper is well-structured and clearly written. The problem motivation is sound, the technical approach is explained logically. 2. The evaluation is thorough, encompassing systematic ablations, major simulation benchmarks (SIMPLER, LIBERO), and real-world deployment on two distinct platforms. 3. The demonstrated capability for zero-shot generalization to novel embodiments addresses a core challenge in the field.
1. The technical contributions, while valuable, exhibit limited novelty relative to existing literature. The proposed proprioceptive Forward Dynamics Model (proprio-FDM), which grounds latent actions by predicting low-level states, is conceptually similar to the approach of Nikulin et al. [1], who employ a linear decoder on latent tokens to predict actions. The efficacy of this general principle for grounding has also been previously analyzed by Zhang et al. [2]. Furthermore, the architectural d
1. The system design is simple, effective, and scalable. 2. The experiments are sufficient and comprehensive. 3. The writing is clear and provides a good reading experience.
See questions.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Explainable Artificial Intelligence (XAI)
