Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model
John Won, Kyungmin Lee, Huiwon Jang, Dongyoung Kim, Jinwoo Shin

TL;DR
DUST introduces a dual-stream diffusion framework for vision-language-action models that effectively handles modality differences, improving robotic task performance in simulation and real-world scenarios.
Contribution
The paper proposes a novel multimodal diffusion transformer architecture with separate modality streams and decoupled training, enabling better joint modeling of vision and action data.
Findings
Achieves up to 6% improvement over baseline in simulated benchmarks.
Improves success rate by 13% on real-world robotic tasks.
Enhances transfer learning performance with large-scale pretraining.
Abstract
Recently, augmenting vision-language-action models (VLAs) with world-models has shown promise in robotic policy learning. However, it remains challenging to jointly predict next-state observations and action sequences because of the inherent difference between the two modalities. To address this, we propose DUal-STream diffusion (DUST), a world-model augmented VLA framework that handles the modality conflict and enhances the performance of VLAs across diverse tasks. Specifically, we propose a multimodal diffusion transformer architecture that explicitly maintains separate modality streams while enabling cross-modal knowledge sharing. In addition, we propose training techniques such as independent noise perturbations for each modality and a decoupled flow matching loss, which enables the model to learn the joint distribution in a bidirectional manner while avoiding the need for a unified…
Peer Reviews
Decision·Submitted to ICLR 2026
1. I think this paper made a good summary for the world model-based VLAs. 2. Frankly speaking, the number of experiments is quite a lot.
1. Overall, the experimental evaluation is required to improved. There are only pick and place tasks. 2. How about the control frequency? 3. I do not think there are significant difference between (b) and (C) in Figure 1.
1. The dual-stream diffusion structure elegantly balances modality decoupling with cross-modal communication 2. The asynchronous denoising method during test-time is useful. 3. The presentation is clear with nice figures, the writing is easy to follow.
1. Real-world validation is only conducted on four pick-and-place tasks. Including a broader range of tasks could further verify the effectiveness of the proposed framework. (Given the limited rebuttal phase, the authors do not need to add additional real-world experiments.) 2. The paper lacks direct comparisons with the most relevant baselines, such as PAD, Video Policy, Video Prediction Policy, and UAV.
- Propose the dual-stream multimodal diffusion transformer for action and image prediction, and the ablation process also demonstrated the effectiveness of separately processing different modes of propagation. - DUST has achieved superior performance over the backbone method in two simulated environments. - DUST can benefit from pretraining on internet-scale data, as ablation studies show.
- **Inadequate ablation analysis** The ablation experiments lack of the using of VLM module. Please supplement the relevant experiments by replacing LLM with a common language model (such as the settings of PAD or MDT). - **Lack of real world demonstration** The real-world experiments involve relatively simple tasks and scenarios. Moreover, the absence of demonstration videos raises concerns about the model’s real-world performance. - **Insufficient baselines** Due to insufficient ba
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Reinforcement Learning in Robotics
