EVLP:Learning Unified Embodied Vision-Language Planner with Reinforced Supervised Fine-Tuning
Xinyan Cai, Shiguang Wu, Dafeng Chi, Yuzheng Zhuang, Xingyue Quan, Jianye Hao, Qiang Guan

TL;DR
EVLP introduces a unified multimodal framework that integrates linguistic reasoning and visual generation for long-horizon embodied manipulation tasks, improving task planning through dynamic pretraining and reinforced fine-tuning.
Contribution
The paper presents a novel unified generation framework with dynamic perception pretraining and reinforced fine-tuning for multimodal embodied planning, addressing inconsistencies in prior methods.
Findings
Effective multimodal planning for long-horizon tasks achieved
Enhanced multimodal correlation through dynamic alignment
Improved spatio-visual reasoning in embodied tasks
Abstract
In complex embodied long-horizon manipulation tasks, effective task decomposition and execution require synergistic integration of textual logical reasoning and visual-spatial imagination to ensure efficient and accurate operation. Current methods fail to adopt a unified generation framework for multimodal planning, lead to inconsistent in multimodal planning. To address this challenge, we present \textbf{EVLP (Embodied Vision-Language Planner)}, an innovative multimodal unified generation framework that jointly models linguistic reasoning and visual generation. Our approach achieves multimodal planning for long-horizon tasks through a novel training pipeline incorporating dynamic pretraining and reinforced alignment. Our core innovations consist of three key components: \textbf{1) Unified Multimodal Generation Framework}: For understanding, We integrate semantic information with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Robot Manipulation and Learning
