Do MLLMs Understand Pointing? Benchmarking and Enhancing Referential Reasoning in Egocentric Vision

Chentao Li; Zirui Gao; Mingze Gao; Yinglian Ren; Jianjiang Feng; and Jie Zhou

arXiv:2604.21461·cs.CV·April 24, 2026

Do MLLMs Understand Pointing? Benchmarking and Enhancing Referential Reasoning in Egocentric Vision

Chentao Li, Zirui Gao, Mingze Gao, Yinglian Ren, Jianjiang Feng, and Jie Zhou

PDF

1 Repo

TL;DR

This paper introduces EgoPoint-Bench, a new benchmark for evaluating and improving multimodal pointing reasoning in egocentric vision, addressing the limitations of current models in grounding spatial semantics of pointing gestures.

Contribution

It presents a comprehensive benchmark with synthetic and real data, demonstrating that fine-tuning models on synthetic data improves egocentric pointing performance and generalization.

Findings

01

State-of-the-art models struggle with egocentric pointing.

02

Fine-tuning on synthetic data improves performance and generalization.

03

The benchmark spans diverse evaluation dimensions and complexity levels.

Abstract

Egocentric AI agents, such as smart glasses, rely on pointing gestures to resolve referential ambiguities in natural language commands. However, despite advancements in Multimodal Large Language Models (MLLMs), current systems often fail to precisely ground the spatial semantics of pointing. Instead, they rely on spurious correlations with visual proximity or object saliency, a phenomenon we term "Referential Hallucination." To address this gap, we introduce EgoPoint-Bench, a comprehensive question-answering benchmark designed to evaluate and enhance multimodal pointing reasoning in egocentric views. Comprising over 11k high-fidelity simulated and real-world samples, the benchmark spans five evaluation dimensions and three levels of referential complexity. Extensive experiments demonstrate that while state-of-the-art proprietary and open-source models struggle with egocentric pointing,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://guyyyug.github.io/EgoPoint-Bench
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.