EfficientEQA: An Efficient Approach to Open-Vocabulary Embodied Question Answering

Kai Cheng; Zhengyuan Li; Xingpeng Sun; Byung-Cheol Min; Amrit Singh Bedi; Aniket Bera

arXiv:2410.20263·cs.RO·August 12, 2025

EfficientEQA: An Efficient Approach to Open-Vocabulary Embodied Question Answering

Kai Cheng, Zhengyuan Li, Xingpeng Sun, Byung-Cheol Min, Amrit Singh Bedi, Aniket Bera

PDF

Open Access

TL;DR

EfficientEQA is a new framework that enables robots to explore efficiently and answer open-vocabulary questions accurately by combining semantic exploration, adaptive stopping, and retrieval-augmented answer generation.

Contribution

It introduces a novel combination of exploration, stopping, and answer generation techniques tailored for open-vocabulary embodied question answering.

Findings

01

Achieves over 15% higher answer accuracy than state-of-the-art methods.

02

Uses over 20% fewer exploration steps.

03

Effectively combines exploration and answer generation for real-world applicability.

Abstract

Embodied Question Answering (EQA) is an essential yet challenging task for robot assistants. Large vision-language models (VLMs) have shown promise for EQA, but existing approaches either treat it as static video question answering without active exploration or restrict answers to a closed set of choices. These limitations hinder real-world applicability, where a robot must explore efficiently and provide accurate answers in open-vocabulary settings. To overcome these challenges, we introduce EfficientEQA, a novel framework that couples efficient exploration with free-form answer generation. EfficientEQA features three key innovations: (1) Semantic-Value-Weighted Frontier Exploration (SFE) with Verbalized Confidence (VC) from a black-box VLM to prioritize semantically important areas to explore, enabling the agent to gather relevant information faster; (2) a BLIP relevancy-based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech and dialogue systems · Topic Modeling

MethodsBLIP: Bootstrapping Language-Image Pre-training · Focus