DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge

Wenyao Zhang; Hongsi Liu; Zekun Qi; Yunnan Wang; Xinqiang Yu; Jiazhao Zhang; Runpei Dong; Jiawei He; Fan Lu; He Wang; Zhizheng Zhang; Li Yi; Wenjun Zeng; Xin Jin

arXiv:2507.04447·cs.CV·August 27, 2025

DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge

Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, Xinqiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, Fan Lu, He Wang, Zhizheng Zhang, Li Yi, Wenjun Zeng, Xin Jin

PDF

Open Access 1 Repo 1 Models

TL;DR

DreamVLA is a novel vision-language-action framework that integrates comprehensive world knowledge forecasting with dynamic, spatial, and semantic cues, enabling improved robot manipulation through a perception-prediction-action loop.

Contribution

It introduces a dynamic-region-guided knowledge prediction and a block-wise attention mechanism, along with a diffusion-based transformer, to enhance robot manipulation by better modeling world knowledge.

Findings

01

Achieves 76.7% success rate on real robot tasks

02

Attains 4.44 average length on CALVIN ABC-D benchmarks

03

Demonstrates improved generalization and reasoning in manipulation tasks

Abstract

Recent advances in vision-language-action (VLA) models have shown promise in integrating image generation with action prediction to improve generalization and reasoning in robot manipulation. However, existing methods are limited to challenging image-based forecasting, which suffers from redundant information and lacks comprehensive and critical world knowledge, including dynamic, spatial and semantic information. To address these limitations, we propose DreamVLA, a novel VLA framework that integrates comprehensive world knowledge forecasting to enable inverse dynamics modeling, thereby establishing a perception-prediction-action loop for manipulation tasks. Specifically, DreamVLA introduces a dynamic-region-guided world knowledge prediction, integrated with the spatial and semantic cues, which provide compact yet comprehensive representations for action planning. This design aligns…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Zhangwenyao1/DreamVLA
noneOfficial

Models

🤗
WenyaoZhang/DreamVLA
model· ♡ 6
♡ 6

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies