World-Value-Action Model: Implicit Planning for Vision-Language-Action Systems

Runze Li; Hongyin Zhang; Junxi Jin; Qixin Zeng; Zifeng Zhuang; Yiqi Tang; Shangke Lyu; Donglin Wang

arXiv:2604.14732·cs.RO·April 21, 2026

World-Value-Action Model: Implicit Planning for Vision-Language-Action Systems

Runze Li, Hongyin Zhang, Junxi Jin, Qixin Zeng, Zifeng Zhuang, Yiqi Tang, Shangke Lyu, Donglin Wang

PDF

1 Repo

TL;DR

The paper introduces the WAV model, a unified framework for implicit planning in vision-language-action systems that improves long-horizon decision making by latent-space inference, outperforming existing methods.

Contribution

It proposes a structured latent representation for future trajectories, enabling implicit planning and long-horizon reasoning in VLA systems, with theoretical and empirical validation.

Findings

01

WAV model outperforms state-of-the-art methods in success rate and robustness.

02

Latent-space inference improves long-horizon planning efficiency.

03

Theoretical analysis shows advantages over action-space planning.

Abstract

Vision-Language-Action (VLA) models have emerged as a promising paradigm for building embodied agents that ground perception and language into action. However, most existing approaches rely on direct action prediction, lacking the ability to reason over long-horizon trajectories and evaluate their consequences, which limits performance in complex decision-making tasks. In this work, we introduce World-Value-Action (WAV) model, a unified framework that enables implicit planning in VLA systems. Rather than performing explicit trajectory optimization, WAV model learn a structured latent representation of future trajectories conditioned on visual observations and language instructions. A learned world model predicts future states, while a trajectory value function evaluates their long-horizon utility. Action generation is then formulated as inference in this latent space, where the model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Win-commit/WAV
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.