DIAL: Decoupling Intent and Action via Latent World Modeling for End-to-End VLA

Yi Chen; Yuying Ge; Hui Zhou; Mingyu Ding; Yixiao Ge; Xihui Liu

arXiv:2603.29844·cs.RO·April 29, 2026

DIAL: Decoupling Intent and Action via Latent World Modeling for End-to-End VLA

Yi Chen, Yuying Ge, Hui Zhou, Mingyu Ding, Yixiao Ge, Xihui Liu

PDF

TL;DR

DIAL introduces a latent intent modeling framework that enhances high-level decision making in vision-language-action tasks, leading to state-of-the-art robotic manipulation with fewer demonstrations.

Contribution

The paper proposes a novel latent world modeling approach with a two-stage training paradigm, improving stability and generalization in end-to-end VLA systems.

Findings

01

Achieves state-of-the-art performance on RoboCasa GR1 with 10x fewer demonstrations.

02

Learns physically grounded manipulation priors from heterogeneous human demonstrations.

03

Demonstrates robust zero-shot generalization to unseen objects and configurations.

Abstract

The development of Vision-Language-Action (VLA) models has been significantly accelerated by pre-trained Vision-Language Models (VLMs). However, most existing end-to-end VLAs treat the VLM primarily as a multimodal encoder, directly mapping vision-language features to low-level actions. This paradigm underutilizes the VLM's potential in high-level decision making and introduces training instability, frequently degrading its rich semantic representations. To address these limitations, we introduce DIAL, a framework bridging high-level decision making and low-level motor execution through a differentiable latent intent bottleneck. Specifically, a VLM-based System-2 performs latent world modeling by synthesizing latent visual foresight within the VLM's native feature space; this foresight explicitly encodes intent and serves as the structural bottleneck. A lightweight System-1 policy then…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.