EfficientEQA: An Efficient Approach to Open-Vocabulary Embodied Question Answering
Kai Cheng, Zhengyuan Li, Xingpeng Sun, Byung-Cheol Min, Amrit Singh Bedi, Aniket Bera

TL;DR
EfficientEQA is a new framework that enables robots to explore efficiently and answer open-vocabulary questions accurately by combining semantic exploration, adaptive stopping, and retrieval-augmented answer generation.
Contribution
It introduces a novel combination of exploration, stopping, and answer generation techniques tailored for open-vocabulary embodied question answering.
Findings
Achieves over 15% higher answer accuracy than state-of-the-art methods.
Uses over 20% fewer exploration steps.
Effectively combines exploration and answer generation for real-world applicability.
Abstract
Embodied Question Answering (EQA) is an essential yet challenging task for robot assistants. Large vision-language models (VLMs) have shown promise for EQA, but existing approaches either treat it as static video question answering without active exploration or restrict answers to a closed set of choices. These limitations hinder real-world applicability, where a robot must explore efficiently and provide accurate answers in open-vocabulary settings. To overcome these challenges, we introduce EfficientEQA, a novel framework that couples efficient exploration with free-form answer generation. EfficientEQA features three key innovations: (1) Semantic-Value-Weighted Frontier Exploration (SFE) with Verbalized Confidence (VC) from a black-box VLM to prioritize semantically important areas to explore, enabling the agent to gather relevant information faster; (2) a BLIP relevancy-based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech and dialogue systems · Topic Modeling
MethodsBLIP: Bootstrapping Language-Image Pre-training · Focus
