DreamPlan: Efficient Reinforcement Fine-Tuning of Vision-Language Planners via Video World Models

Emily Yue-Ting Jia; Weiduo Yuan; Tianheng Shi; Vitor Guizilini; Jiageng Mao; Yue Wang

arXiv:2603.16860·cs.RO·March 18, 2026

DreamPlan: Efficient Reinforcement Fine-Tuning of Vision-Language Planners via Video World Models

Emily Yue-Ting Jia, Weiduo Yuan, Tianheng Shi, Vitor Guizilini, Jiageng Mao, Yue Wang

PDF

Open Access

TL;DR

DreamPlan introduces a reinforcement fine-tuning framework for vision-language models using video world models, enabling efficient physical grounding and improved manipulation success without extensive real-world data.

Contribution

It proposes a novel method to fine-tune VLMs via virtual rollouts in a learned video world model, reducing reliance on costly real-world interactions.

Findings

01

Significantly improves manipulation success rates.

02

Efficiently injects physical knowledge into VLMs.

03

Reduces need for large-scale real-world data.

Abstract

Robotic manipulation requires sophisticated commonsense reasoning, a capability naturally possessed by large-scale Vision-Language Models (VLMs). While VLMs show promise as zero-shot planners, their lack of grounded physical understanding often leads to compounding errors and low success rates when deployed in complex real-world environments, particularly for challenging tasks like deformable object manipulation. Although Reinforcement Learning (RL) can adapt these planners to specific task dynamics, directly fine-tuning VLMs via real-world interaction is prohibitively expensive, unsafe, and sample-inefficient. To overcome this bottleneck, we introduce DreamPlan, a novel framework for the reinforcement fine-tuning of VLM planners via video world models. Instead of relying on costly physical rollouts, DreamPlan first leverages the zero-shot VLM to collect exploratory interaction data. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Reinforcement Learning in Robotics