Embodied Instruction Following in Unknown Environments
Zhenyu Wu, Ziwei Wang, Xiuwei Xu, Hang Yin, Yinan Liang, Angyuan Ma, Jiwen Lu, Haibin Yan

TL;DR
This paper introduces a hierarchical embodied instruction following framework enabling agents to explore unknown environments and generate feasible plans for complex human instructions, significantly improving success rates in household tasks.
Contribution
The authors propose a novel hierarchical framework combining high-level planning and low-level exploration using multimodal large language models for unknown environments.
Findings
Achieved 45.09% success rate on complex household instructions.
Effectively integrates scene exploration with task planning.
Demonstrates improved performance over existing methods.
Abstract
Enabling embodied agents to complete complex human instructions from natural language is crucial to autonomous systems in household services. Conventional methods can only accomplish human instructions in the known environment where all interactive objects are provided to the embodied agent, and directly deploying the existing approaches for the unknown environment usually generates infeasible plans that manipulate non-existing objects. On the contrary, we propose an embodied instruction following (EIF) method for complex tasks in the unknown environment, where the agent efficiently explores the unknown environment to generate feasible plans with existing objects to accomplish abstract instructions. Specifically, we build a hierarchical embodied instruction following framework including the high-level task planner and the low-level exploration controller with multimodal large language…
Peer Reviews
Decision·Submitted to ICLR 2025
- The paper is generally written well and easy to follow. - The paper aims to address a challenging problem of EIF in environments that may change during the agent's task completion. - The proposed approach achieves strong performance over the baselines.
- Unknown environments are not well defined. An environment is comprised of many elements, such as layouts, textures, object classes and instances, *etc.* It is not clear what is unseen to the agent. In addition, existing EIF benchmarks, like ALFRED, also evaluate agents in "unseen" setups. What is the difference between the unseen environments used here and the existing EIF benchmarks? - Unlike prior EIF literature, the paper uses depth as additional input, but such depth information is unreli
1. The motivation of the paper is very good, taking into account situations in real environments where items may not be present at the time of planning. 2. The proposed framework is effective and also well-motivated. Combinations between high-level planners and low-level controllers are common, but the feature maps used in this work are interesting. 3. The paper is written in a fluid and well-organized manner.
1. In the main experiment the authors only compared LLM-P and FILM, the lack of other baselines weakens the superiority of the proposed approach. Here are three different types of baselines: (1) I would suggest adding some Object Navigation Method with LLM as Planner, such as MOPA[1], LGX[2]. (2) Another very similar work, Demand-driven Navigation [3] also uses human instructions as input, has similar tasks, e.g., "I am thirsty" with "I need to drink water" and similar task motivation.
The use of scene feature maps to facilitate exploration in unknown environments, serving as visual input for VLMs while integrating textual instructions, presents an intriguing, well-founded, and promising approach. This framework combines planning and action seamlessly through the high-level task planner and low-level exploration controller, resulting in a cohesive and efficient system for embodied instruction following.
Experimental Environment Limitations: This method has been validated on a single simulator. It would strengthen the evaluation to include additional simulators, such as Habitat or iGibson, or to test the method on a real-world robotic platform to further demonstrate its robustness. Limited Scalability and Computational Efficiency: The approach currently shows limited scalability and relatively low computational efficiency, which could impact its practical deployment in larger environments. Lack
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEducational Environments and Student Outcomes
MethodsSoftmax · Attention Is All You Need
