EVLP:Learning Unified Embodied Vision-Language Planner with Reinforced Supervised Fine-Tuning

Xinyan Cai; Shiguang Wu; Dafeng Chi; Yuzheng Zhuang; Xingyue Quan; Jianye Hao; Qiang Guan

arXiv:2511.05553·cs.CV·November 11, 2025

EVLP:Learning Unified Embodied Vision-Language Planner with Reinforced Supervised Fine-Tuning

Xinyan Cai, Shiguang Wu, Dafeng Chi, Yuzheng Zhuang, Xingyue Quan, Jianye Hao, Qiang Guan

PDF

Open Access

TL;DR

EVLP introduces a unified multimodal framework that integrates linguistic reasoning and visual generation for long-horizon embodied manipulation tasks, improving task planning through dynamic pretraining and reinforced fine-tuning.

Contribution

The paper presents a novel unified generation framework with dynamic perception pretraining and reinforced fine-tuning for multimodal embodied planning, addressing inconsistencies in prior methods.

Findings

01

Effective multimodal planning for long-horizon tasks achieved

02

Enhanced multimodal correlation through dynamic alignment

03

Improved spatio-visual reasoning in embodied tasks

Abstract

In complex embodied long-horizon manipulation tasks, effective task decomposition and execution require synergistic integration of textual logical reasoning and visual-spatial imagination to ensure efficient and accurate operation. Current methods fail to adopt a unified generation framework for multimodal planning, lead to inconsistent in multimodal planning. To address this challenge, we present \textbf{EVLP (Embodied Vision-Language Planner)}, an innovative multimodal unified generation framework that jointly models linguistic reasoning and visual generation. Our approach achieves multimodal planning for long-horizon tasks through a novel training pipeline incorporating dynamic pretraining and reinforced alignment. Our core innovations consist of three key components: \textbf{1) Unified Multimodal Generation Framework}: For understanding, We integrate semantic information with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Robot Manipulation and Learning