Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model

John Won; Kyungmin Lee; Huiwon Jang; Dongyoung Kim; Jinwoo Shin

arXiv:2510.27607·cs.CV·November 5, 2025

Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model

John Won, Kyungmin Lee, Huiwon Jang, Dongyoung Kim, Jinwoo Shin

PDF

Open Access 3 Reviews

TL;DR

DUST introduces a dual-stream diffusion framework for vision-language-action models that effectively handles modality differences, improving robotic task performance in simulation and real-world scenarios.

Contribution

The paper proposes a novel multimodal diffusion transformer architecture with separate modality streams and decoupled training, enabling better joint modeling of vision and action data.

Findings

01

Achieves up to 6% improvement over baseline in simulated benchmarks.

02

Improves success rate by 13% on real-world robotic tasks.

03

Enhances transfer learning performance with large-scale pretraining.

Abstract

Recently, augmenting vision-language-action models (VLAs) with world-models has shown promise in robotic policy learning. However, it remains challenging to jointly predict next-state observations and action sequences because of the inherent difference between the two modalities. To address this, we propose DUal-STream diffusion (DUST), a world-model augmented VLA framework that handles the modality conflict and enhances the performance of VLAs across diverse tasks. Specifically, we propose a multimodal diffusion transformer architecture that explicitly maintains separate modality streams while enabling cross-modal knowledge sharing. In addition, we propose training techniques such as independent noise perturbations for each modality and a decoupled flow matching loss, which enables the model to learn the joint distribution in a bidirectional manner while avoiding the need for a unified…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 4

Strengths

1. I think this paper made a good summary for the world model-based VLAs. 2. Frankly speaking, the number of experiments is quite a lot.

Weaknesses

1. Overall, the experimental evaluation is required to improved. There are only pick and place tasks. 2. How about the control frequency? 3. I do not think there are significant difference between (b) and (C) in Figure 1.

Reviewer 02Rating 6Confidence 4

Strengths

1. The dual-stream diffusion structure elegantly balances modality decoupling with cross-modal communication 2. The asynchronous denoising method during test-time is useful. 3. The presentation is clear with nice figures, the writing is easy to follow.

Weaknesses

1. Real-world validation is only conducted on four pick-and-place tasks. Including a broader range of tasks could further verify the effectiveness of the proposed framework. (Given the limited rebuttal phase, the authors do not need to add additional real-world experiments.) 2. The paper lacks direct comparisons with the most relevant baselines, such as PAD, Video Policy, Video Prediction Policy, and UAV.

Reviewer 03Rating 4Confidence 5

Strengths

- Propose the dual-stream multimodal diffusion transformer for action and image prediction, and the ablation process also demonstrated the effectiveness of separately processing different modes of propagation. - DUST has achieved superior performance over the backbone method in two simulated environments. - DUST can benefit from pretraining on internet-scale data, as ablation studies show.

Weaknesses

- **Inadequate ablation analysis** The ablation experiments lack of the using of VLM module. Please supplement the relevant experiments by replacing LLM with a common language model (such as the settings of PAD or MDT). - **Lack of real world demonstration** The real-world experiments involve relatively simple tasks and scenarios. Moreover, the absence of demonstration videos raises concerns about the model’s real-world performance. - **Insufficient baselines** Due to insufficient ba

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Reinforcement Learning in Robotics