Memory-Guided View Refinement for Dynamic Human-in-the-loop EQA

Xin Lu; Rui Li; Xun Huang; Weixin Li; Chuanqing Zhuang; Jiayuan Li; Zhengda Lu; Jun Xiao; Yunhong Wang

arXiv:2603.09541·cs.CV·March 11, 2026

Memory-Guided View Refinement for Dynamic Human-in-the-loop EQA

Xin Lu, Rui Li, Xun Huang, Weixin Li, Chuanqing Zhuang, Jiayuan Li, Zhengda Lu, Jun Xiao, Yunhong Wang

PDF

Open Access

TL;DR

This paper introduces a new framework called DIVRR for dynamic embodied question answering that improves robustness to occlusions and view changes by refining views and selectively managing memory, enabling efficient and accurate reasoning in dynamic scenes.

Contribution

The paper proposes DIVRR, a training-free, relevance-guided view refinement and memory selection framework for dynamic EQA, addressing occlusion ambiguity and evidence management.

Findings

01

DIVRR enhances robustness in dynamic scenes with occlusions.

02

DIVRR maintains high inference efficiency with compact memory.

03

Experiments show consistent improvements over baselines on DynHiL-EQA and HM-EQA datasets.

Abstract

Embodied Question Answering (EQA) has traditionally been evaluated in temporally stable environments where visual evidence can be accumulated reliably. However, in dynamic, human-populated scenes, human activities and occlusions introduce significant perceptual non-stationarity: task-relevant cues are transient and view-dependent, while a store-then-retrieve strategy over-accumulates redundant evidence and increases inference cost. This setting exposes two practical challenges for EQA agents: resolving ambiguity caused by viewpoint-dependent occlusions, and maintaining compact yet up-to-date evidence for efficient inference. To enable systematic study of this setting, we introduce DynHiL-EQA, a human-in-the-loop EQA dataset with two subsets: a Dynamic subset featuring human activities and temporal changes, and a Static subset with temporally stable observations. To address the above…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning