TL;DR
The paper introduces the WAV model, a unified framework for implicit planning in vision-language-action systems that improves long-horizon decision making by latent-space inference, outperforming existing methods.
Contribution
It proposes a structured latent representation for future trajectories, enabling implicit planning and long-horizon reasoning in VLA systems, with theoretical and empirical validation.
Findings
WAV model outperforms state-of-the-art methods in success rate and robustness.
Latent-space inference improves long-horizon planning efficiency.
Theoretical analysis shows advantages over action-space planning.
Abstract
Vision-Language-Action (VLA) models have emerged as a promising paradigm for building embodied agents that ground perception and language into action. However, most existing approaches rely on direct action prediction, lacking the ability to reason over long-horizon trajectories and evaluate their consequences, which limits performance in complex decision-making tasks. In this work, we introduce World-Value-Action (WAV) model, a unified framework that enables implicit planning in VLA systems. Rather than performing explicit trajectory optimization, WAV model learn a structured latent representation of future trajectories conditioned on visual observations and language instructions. A learned world model predicts future states, while a trajectory value function evaluates their long-horizon utility. Action generation is then formulated as inference in this latent space, where the model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
