Do You See What I Am Pointing At? Gesture-Based Egocentric Video Question Answering

Yura Choi; Roy Miles; Rolandos Alexandros Potamias; Ismail Elezi; Jiankang Deng; Stefanos Zafeiriou

arXiv:2603.12533·cs.CV·March 31, 2026

Do You See What I Am Pointing At? Gesture-Based Egocentric Video Question Answering

Yura Choi, Roy Miles, Rolandos Alexandros Potamias, Ismail Elezi, Jiankang Deng, Stefanos Zafeiriou

PDF

2 Repos

TL;DR

This paper introduces EgoPointVQA, a new dataset and benchmark for gesture-based egocentric video question answering, along with HINT tokens that improve model understanding of pointing gestures, achieving state-of-the-art results.

Contribution

The authors present EgoPointVQA and HINT tokens, enabling better interpretation of pointing gestures in egocentric videos, advancing multimodal AI reasoning capabilities.

Findings

01

HINT-14B achieves 68.1% accuracy, surpassing previous models.

02

Models with HINT tokens outperform baselines across multiple tasks.

03

The dataset includes 4000 synthetic and 400 real-world videos for deictic reasoning.

Abstract

Understanding and answering questions based on a user's pointing gesture is essential for next-generation egocentric AI assistants. However, current Multimodal Large Language Models (MLLMs) struggle with such tasks due to the lack of gesture-rich data and their limited ability to infer fine-grained pointing intent from egocentric video. To address this, we introduce EgoPointVQA, a dataset and benchmark for gesture-grounded egocentric question answering, comprising 4000 synthetic and 400 real-world videos across multiple deictic reasoning tasks. Built upon it, we further propose Hand Intent Tokens (HINT), which encodes tokens derived from 3D hand keypoints using an off-the-shelf reconstruction model and interleaves them with the model input to provide explicit spatial and temporal context for interpreting pointing intent. We show that our model outperforms others in different backbones…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.