TL;DR
This paper introduces EgoPointVQA, a new dataset and benchmark for gesture-based egocentric video question answering, along with HINT tokens that improve model understanding of pointing gestures, achieving state-of-the-art results.
Contribution
The authors present EgoPointVQA and HINT tokens, enabling better interpretation of pointing gestures in egocentric videos, advancing multimodal AI reasoning capabilities.
Findings
HINT-14B achieves 68.1% accuracy, surpassing previous models.
Models with HINT tokens outperform baselines across multiple tasks.
The dataset includes 4000 synthetic and 400 real-world videos for deictic reasoning.
Abstract
Understanding and answering questions based on a user's pointing gesture is essential for next-generation egocentric AI assistants. However, current Multimodal Large Language Models (MLLMs) struggle with such tasks due to the lack of gesture-rich data and their limited ability to infer fine-grained pointing intent from egocentric video. To address this, we introduce EgoPointVQA, a dataset and benchmark for gesture-grounded egocentric question answering, comprising 4000 synthetic and 400 real-world videos across multiple deictic reasoning tasks. Built upon it, we further propose Hand Intent Tokens (HINT), which encodes tokens derived from 3D hand keypoints using an off-the-shelf reconstruction model and interleaves them with the model input to provide explicit spatial and temporal context for interpreting pointing intent. We show that our model outperforms others in different backbones…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
