ThinkBot: Embodied Instruction Following with Thought Chain Reasoning
Guanxing Lu, Ziwei Wang, Changliu Liu, Jiwen Lu, Yansong Tang

TL;DR
ThinkBot enhances embodied instruction following by reasoning through thought chains to recover missing actions, leading to more successful and efficient task completion in complex environments.
Contribution
It introduces a thought chain reasoning approach with an instruction completer and object localizer, improving EIF performance over existing methods.
Findings
Outperforms state-of-the-art EIF methods in success rate
Achieves higher execution efficiency
Effective in complex simulated environments
Abstract
Embodied Instruction Following (EIF) requires agents to complete human instruction by interacting objects in complicated surrounding environments. Conventional methods directly consider the sparse human instruction to generate action plans for agents, which usually fail to achieve human goals because of the instruction incoherence in action descriptions. On the contrary, we propose ThinkBot that reasons the thought chain in human instruction to recover the missing action descriptions, so that the agent can successfully complete human goals by following the coherent instruction. Specifically, we first design an instruction completer based on large language models to recover the missing actions with interacted objects between consecutive human instruction, where the perceived surrounding environments and the completed sub-goals are considered for instruction completion. Based on the…
Peer Reviews
Decision·ICLR 2025 Poster
- The paper presents a novel approach to Embodied Instruction Following (EIF) by introducing the concept of a "thought chain" to recover missing action descriptions from sparse human instructions. Integrating an instruction completer and an object localizer enhances the agent's ability to follow instructions in complex environments. This is an effective way to address the coherence issue in action plans. - The experiments in the paper demonstrate the effectiveness of ThinkBot through extensive t
- The introduced approach heavily relies on the inherent reasoning capabilities of LLMs. If the LLMs are not trained on diverse or comprehensive datasets, it could lead to biases or limitations in the agent's performance. The authors need to demonstrate the robustness of the prompt to different LLMs, e.g., Llama 3, and other open-sourced LLMs (gemma2, qwen2.5). - While the paper demonstrates strong performance in simulated environments, there might be concerns about how well ThinkBot's approach
The proposed pipeline is technically sound. Using LLM to understand free-form instructions is an effective way to reason about actions in unstructured environments. And the proposed object localizer complements the LLM by addressing its spatial understanding limitations.
1. While effective, this work appears to be a combination of LLM with a localizer, which might lack of novelty. Existing multimodal LLMs (MLLMs) can achieve better reasoning performance than the LLM used here. The authors employed GPT-3.5, but using a state-of-the-art MLLM like GPT-4o could improve performance.
- ThinkBot tries to figure out missing actions from instructions, which is an important and challenging problem of embodied instruction following. - ThinkBot shows a strong improvement over the previous methods (yet the baseline, Prompter+, on which ThinkBot is built already beats SoTA, see weaknesses). - The paper is well structured and shows clear presentation.
- It looks like the instruction completer finds where is a target object by addressing missing information in the goal statement, but the step-by-step instructions usually contain the missing information. Why not just directly parse the missing information in the step-by-step instructions? Why do we still need the proposed instruction completer? - Previous work (Prompter) also similarly finds the next receptacle by using language models (BERT) by guessing where a target object is likely to be.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Robot Manipulation and Learning
