Beyond Language: Grounding Referring Expressions with Hand Pointing in Egocentric Vision
Ling Li, Bowen Liu, Zinuo Zhan, Peng Jie, Jianhui Zhong, Kenglun Chang, Zhidong Deng

TL;DR
This paper introduces EgoPoint-Ground, a large-scale egocentric dataset for multimodal visual grounding using hand pointing, and proposes SV-CoT, a new inference framework that improves grounding accuracy by integrating gestural and linguistic cues.
Contribution
The paper presents the first egocentric deictic visual grounding dataset and a novel structured inference method that enhances multimodal grounding performance.
Findings
SV-CoT achieves 11.7% absolute improvement over existing methods.
The dataset includes over 15,000 samples with rich annotations.
Extensive experiments validate the effectiveness of the proposed approach.
Abstract
Traditional Visual Grounding (VG) predominantly relies on textual descriptions to localize objects, a paradigm that inherently struggles with linguistic ambiguity and often ignores non-verbal deictic cues prevalent in real-world interactions. In natural egocentric engagements, hand-pointing combined with speech forms the most intuitive referring mechanism. To bridge this gap, we introduce EgoPoint-Ground, the first large-scale multimodal dataset dedicated to egocentric deictic visual grounding. Comprising over \textbf{15k} interactive samples in complex scenes, the dataset provides rich, multi-grained annotations including hand-target bounding box pairs and dense semantic captions. We establish a comprehensive benchmark for hand-pointing referring expression resolution, evaluating a wide spectrum of mainstream Multimodal Large Language Models (MLLMs) and state-of-the-art VG…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
