FAST-EQA: Efficient Embodied Question Answering with Global and Local Region Relevancy
Haochen Zhang, Nirav Savaliya, Faizan Siddiqui, Enna Sachdeva

TL;DR
FAST-EQA is a novel framework for embodied question answering that improves efficiency and accuracy by focusing on question-relevant regions, employing a bounded memory, and guiding exploration with global and local cues.
Contribution
It introduces a question-conditioned approach with a bounded scene memory and a global exploration policy to enhance efficiency and robustness in embodied question answering.
Findings
Achieves state-of-the-art performance on HMEQA and EXPRESS-Bench datasets.
Runs substantially faster than prior approaches while maintaining high accuracy.
Effectively handles both single and multi-target questions with bounded memory.
Abstract
Embodied Question Answering (EQA) combines visual scene understanding, goal-directed exploration, spatial and temporal reasoning under partial observability. A central challenge is to confine physical search to question-relevant subspaces while maintaining a compact, actionable memory of observations. Furthermore, for real-world deployment, fast inference time during exploration is crucial. We introduce FAST-EQA, a question-conditioned framework that (i) identifies likely visual targets, (ii) scores global regions of interest to guide navigation, and (iii) employs Chain-of-Thought (CoT) reasoning over visual memory to answer confidently. FAST-EQA maintains a bounded scene memory that stores a fixed-capacity set of region-target hypotheses and updates them online, enabling robust handling of both single and multi-target questions without unbounded growth. To expand coverage efficiently,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robotics and Sensor-Based Localization · Social Robot Interaction and HRI
