Multi-Step Reasoning for Embodied Question Answering via Tool Augmentation
Mingliang Zhai, Hansheng Liang, Xiaomeng Fan, Zhi Gao, Chuanhao Li, Che Sun, Xu Bin, Yuwei Wu, Yunde Jia

TL;DR
This paper introduces ToolEQA, a multi-step reasoning agent for embodied question answering that uses external tools to improve exploration efficiency and answer accuracy in 3D environments.
Contribution
It presents ToolEQA, a novel approach integrating external tools with multi-step reasoning and a new large-scale dataset for training and evaluation.
Findings
ToolEQA improves success rate by up to 20.2% over baselines.
It outperforms zero-shot versions by 10% in success rate.
Achieves state-of-the-art results on multiple EQA datasets.
Abstract
Embodied Question Answering (EQA) requires agents to explore 3D environments to obtain observations and answer questions related to the scene. Existing methods leverage VLMs to directly explore the environment and answer questions without explicit thinking or planning, which limits their reasoning ability and results in excessive or inefficient exploration as well as ineffective responses. In this paper, we introduce ToolEQA, an agent that integrates external tools with multi-step reasoning, where external tools can provide more useful information for completing the task, helping the model derive better exploration directions in the next step of reasoning and thus obtaining additional effective information. This enables ToolEQA to generate more accurate responses with a shorter exploration distance. To enhance the model's ability for tool-usage and multi-step reasoning, we further…
Peer Reviews
Decision·Submitted to ICLR 2026
1.The ToolEQA framework uses explicit reasoning to guide its actions, resulting in shorter navigation paths and more efficient task completion compared to previous methods. 2.The paper contributes EQA-RT, a large-scale dataset with detailed reasoning trajectories. It ingeniously integrates 3D object detection, GPT-4o for question generation, and A* for optimal path planning, all validated by a multi-level verifier, to produce high-quality training data. The automated pipeline used to generate th
1.The core ToolEQA framework is conceptually indistinct from established tool-augmented LLM paradigms like ReAct and Toolformer. Its "thought-code-observation" loop is a direct application of this existing work, making the contribution feel more like a conceptual repackaging for the EQA domain rather than a fundamental innovation. The authors should more clearly articulate what unique, non-trivial architectural adaptations were made to the framework specifically for the challenges of embodiment,
The paper focuses on an important problem of using VLM agents to solve embodied QA. Overall, the paper is well written. The main agentic contribution also is novel and seems to provide some potential benefit over baselines.
Questions 4.3 details: I think the most important part of the paper is the dataset curation strategy. While there are sufficient details around how the questions and paths are generated, there is insufficient details around how the reasoning / thought traces were curated. For example, reasoning trace is also generated when the model/agent should output “Move Left” — how does the thought trace look for this kind of action? When generating the GT data do you provide all the history (with all the
1. Authors explore the relatively understudied extension of tool usage to the application of embodied question answering. 2. Usefulness of generated data: finetuning on the collected dataset improves tool usage and EQA success rates. 3. The paper's ToolEQA agent outperforms prior method (Fine-EQA) on Fine-EQA's benchmark (EXPRESSBench).
1. The paper has critical gaps in evaluation and analysis that do not fully establish the benefits of tool usage as well as improvements over prior works: a. The authors do not compare to a comparable baseline that follows the same pipeline but replaces the tools with VLM (eg. use the same VLM to compare sizes of objects in two frames). It is not clear if the gains are coming from breaking down reasoning into tools (multi-step reasoning) or the use of external tools. b. The evaluation
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
