Egocentric Vision Language Planning
Zhirui Fang, Ming Yang, Weishuai Zeng, Boyu Li, Junpeng Yue, Ziluo, Ding, Xiu Li, Zongqing Lu

TL;DR
This paper introduces EgoPlan, a novel egocentric vision language planning framework that combines large multimodal models and diffusion models to improve long-horizon task planning and execution in household environments.
Contribution
EgoPlan is the first approach to integrate LMMs with diffusion models for egocentric task planning, enhancing generalization and success in household scenarios.
Findings
EgoPlan outperforms baselines in long-horizon task success rates.
The model effectively integrates visual grounding with symbolic planning.
Generalizes well across different household environments.
Abstract
We explore leveraging large multi-modal models (LMMs) and text2image models to build a more general embodied agent. LMMs excel in planning long-horizon tasks over symbolic abstractions but struggle with grounding in the physical world, often failing to accurately identify object positions in images. A bridge is needed to connect LMMs to the physical world. The paper proposes a novel approach, egocentric vision language planning (EgoPlan), to handle long-horizon tasks from an egocentric perspective in varying household scenarios. This model leverages a diffusion model to simulate the fundamental dynamics between states and actions, integrating techniques like style transfer and optical flow to enhance generalization across different environmental dynamics. The LMM serves as a planner, breaking down instructions into sub-goals and selecting actions based on their alignment with these…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
1. The framework is well-motivated and reasonable. 2. The data effort will be of good use to future works. 3. The paper is well-organized and easy to read. 4. The proposed method outperforms the baseline.
1. Some crucial ablation studies are missing. How does the framework perform without optical flow and style transfer? 2. Some related works may share similar motivations using diffusion models for world dynamics, and dynamics for planning, you may consider to cite. [1] 3D-VLA: A 3D Vision-Language-Action Generative World Model [2] Diffusion Reward: Learning Rewards via Conditional Video Diffusion
1. The paper is well organized and easy to follow. 2. The idea of using optical flow to generalize across environments is reasonable and novel.
1. The idea of using generative model as world model [1,2,3,4] and LLM as task planner [5,6] have been widely studied in previous works. 2. (contd. 1.) The unique contribution of this paper appears to be the use of optical flow to generalize the world model across diverse environments. However, the experiment results are not sufficient to support this claim. Including task execution results rather than solely optical flow error across different simulators, could provide more comprehensive eviden
1. Authors perform a number of ablation studies to demonstrate the usefulness of each model component 2. Authors compare their model to many different baselines.
1. The authors' use of "world model" to describe the paper's diffusion (image editing) component is highly exaggerated. By definition, world models should record and keep track of complete and accurate environment states. However, here the diffusion model is merely LoRA finetuned to edit the provided image in an in-distribution manner. The authors also fail to explain why their diffusion module can remotely constitute a world model in their methods section. 2. The InstructP2P model is known to
1. Innovative Integration of LMMs and Diffusion Models: The paper presents a novel approach by combining LMMs with diffusion models for planning and action prediction in egocentric embodied environments. 2. Incorporation of Computer Vision Techniques: The use of style transfer and optical flow enhances the model’s ability to generalize across different scenes and adapt to spatial changes, which is crucial for embodied agents. 3. Dataset Contribution: The authors have collected the VH-1.5M datase
1. Lack of Planning Instructions and Time Details: The paper does not provide specific planning instructions for the high-level goal decomposition shown in Fig. 2, nor does it mention the duration of the planning process. This omission makes it difficult to evaluate the efficiency and effectiveness of your planning method. 2. Insufficient Details on Diffusion Model Training: There is a lack of detailed information on how the diffusion models (particularly the World Model and the Image Subgoal Ge
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReligious Tourism and Spaces
MethodsDiffusion
