ThinkBot: Embodied Instruction Following with Thought Chain Reasoning

Guanxing Lu; Ziwei Wang; Changliu Liu; Jiwen Lu; Yansong Tang

arXiv:2312.07062·cs.CV·December 15, 2023·1 cites

ThinkBot: Embodied Instruction Following with Thought Chain Reasoning

Guanxing Lu, Ziwei Wang, Changliu Liu, Jiwen Lu, Yansong Tang

PDF

Open Access 3 Reviews

TL;DR

ThinkBot enhances embodied instruction following by reasoning through thought chains to recover missing actions, leading to more successful and efficient task completion in complex environments.

Contribution

It introduces a thought chain reasoning approach with an instruction completer and object localizer, improving EIF performance over existing methods.

Findings

01

Outperforms state-of-the-art EIF methods in success rate

02

Achieves higher execution efficiency

03

Effective in complex simulated environments

Abstract

Embodied Instruction Following (EIF) requires agents to complete human instruction by interacting objects in complicated surrounding environments. Conventional methods directly consider the sparse human instruction to generate action plans for agents, which usually fail to achieve human goals because of the instruction incoherence in action descriptions. On the contrary, we propose ThinkBot that reasons the thought chain in human instruction to recover the missing action descriptions, so that the agent can successfully complete human goals by following the coherent instruction. Specifically, we first design an instruction completer based on large language models to recover the missing actions with interacted objects between consecutive human instruction, where the perceived surrounding environments and the completed sub-goals are considered for instruction completion. Based on the…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 4

Strengths

- The paper presents a novel approach to Embodied Instruction Following (EIF) by introducing the concept of a "thought chain" to recover missing action descriptions from sparse human instructions. Integrating an instruction completer and an object localizer enhances the agent's ability to follow instructions in complex environments. This is an effective way to address the coherence issue in action plans. - The experiments in the paper demonstrate the effectiveness of ThinkBot through extensive t

Weaknesses

- The introduced approach heavily relies on the inherent reasoning capabilities of LLMs. If the LLMs are not trained on diverse or comprehensive datasets, it could lead to biases or limitations in the agent's performance. The authors need to demonstrate the robustness of the prompt to different LLMs, e.g., Llama 3, and other open-sourced LLMs (gemma2, qwen2.5). - While the paper demonstrates strong performance in simulated environments, there might be concerns about how well ThinkBot's approach

Reviewer 02Rating 5Confidence 2

Strengths

The proposed pipeline is technically sound. Using LLM to understand free-form instructions is an effective way to reason about actions in unstructured environments. And the proposed object localizer complements the LLM by addressing its spatial understanding limitations.

Weaknesses

1. While effective, this work appears to be a combination of LLM with a localizer, which might lack of novelty. Existing multimodal LLMs (MLLMs) can achieve better reasoning performance than the LLM used here. The authors employed GPT-3.5, but using a state-of-the-art MLLM like GPT-4o could improve performance.

Reviewer 03Rating 6Confidence 4

Strengths

- ThinkBot tries to figure out missing actions from instructions, which is an important and challenging problem of embodied instruction following. - ThinkBot shows a strong improvement over the previous methods (yet the baseline, Prompter+, on which ThinkBot is built already beats SoTA, see weaknesses). - The paper is well structured and shows clear presentation.

Weaknesses

- It looks like the instruction completer finds where is a target object by addressing missing information in the goal statement, but the step-by-step instructions usually contain the missing information. Why not just directly parse the missing information in the step-by-step instructions? Why do we still need the proposed instruction completer? - Previous work (Prompter) also similarly finds the next receptacle by using language models (BERT) by guessing where a target object is likely to be.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Robot Manipulation and Learning