TL;DR
This paper introduces EgoPoint-Bench, a new benchmark for evaluating and improving multimodal pointing reasoning in egocentric vision, addressing the limitations of current models in grounding spatial semantics of pointing gestures.
Contribution
It presents a comprehensive benchmark with synthetic and real data, demonstrating that fine-tuning models on synthetic data improves egocentric pointing performance and generalization.
Findings
State-of-the-art models struggle with egocentric pointing.
Fine-tuning on synthetic data improves performance and generalization.
The benchmark spans diverse evaluation dimensions and complexity levels.
Abstract
Egocentric AI agents, such as smart glasses, rely on pointing gestures to resolve referential ambiguities in natural language commands. However, despite advancements in Multimodal Large Language Models (MLLMs), current systems often fail to precisely ground the spatial semantics of pointing. Instead, they rely on spurious correlations with visual proximity or object saliency, a phenomenon we term "Referential Hallucination." To address this gap, we introduce EgoPoint-Bench, a comprehensive question-answering benchmark designed to evaluate and enhance multimodal pointing reasoning in egocentric views. Comprising over 11k high-fidelity simulated and real-world samples, the benchmark spans five evaluation dimensions and three levels of referential complexity. Extensive experiments demonstrate that while state-of-the-art proprietary and open-source models struggle with egocentric pointing,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
