DreamPlan: Efficient Reinforcement Fine-Tuning of Vision-Language Planners via Video World Models
Emily Yue-Ting Jia, Weiduo Yuan, Tianheng Shi, Vitor Guizilini, Jiageng Mao, Yue Wang

TL;DR
DreamPlan introduces a reinforcement fine-tuning framework for vision-language models using video world models, enabling efficient physical grounding and improved manipulation success without extensive real-world data.
Contribution
It proposes a novel method to fine-tune VLMs via virtual rollouts in a learned video world model, reducing reliance on costly real-world interactions.
Findings
Significantly improves manipulation success rates.
Efficiently injects physical knowledge into VLMs.
Reduces need for large-scale real-world data.
Abstract
Robotic manipulation requires sophisticated commonsense reasoning, a capability naturally possessed by large-scale Vision-Language Models (VLMs). While VLMs show promise as zero-shot planners, their lack of grounded physical understanding often leads to compounding errors and low success rates when deployed in complex real-world environments, particularly for challenging tasks like deformable object manipulation. Although Reinforcement Learning (RL) can adapt these planners to specific task dynamics, directly fine-tuning VLMs via real-world interaction is prohibitively expensive, unsafe, and sample-inefficient. To overcome this bottleneck, we introduce DreamPlan, a novel framework for the reinforcement fine-tuning of VLM planners via video world models. Instead of relying on costly physical rollouts, DreamPlan first leverages the zero-shot VLM to collect exploratory interaction data. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Reinforcement Learning in Robotics
