Detecting Precise Hand Touch Moments in Egocentric Video
Huy Anh Nguyen, Feras Dayoub, Minh Hoai

TL;DR
This paper presents HiCE, a novel method for accurately detecting hand contact moments in egocentric videos, crucial for AR and HCI, using spatiotemporal features and a new dataset.
Contribution
The introduction of the HiCE module with cross-attention and the TouchMoment dataset for precise hand contact detection in first-person videos.
Findings
HiCE outperforms state-of-the-art baselines by 16.91% in average precision.
TouchMoment dataset contains over 4,000 videos and 8,456 contact annotations.
The method achieves high accuracy within a two-frame tolerance of contact moments.
Abstract
We address the challenging task of detecting the precise moment when hands make contact with objects in egocentric videos. This frame-level detection is crucial for augmented reality, human-computer interaction, assistive technologies, and robot learning applications, where contact onset signals action initiation or completion. Temporally precise detection is particularly challenging due to subtle hand motion variations near contact, frequent occlusions, fine-grained manipulation patterns, and the inherent motion dynamics of first-person perspectives. To tackle these challenges, we propose a Hand-informed Context Enhanced module (HiCE; pronounced `high-see') that leverages spatiotemporal features from hand regions and their surrounding context through cross-attention mechanisms, learning to identify potential contact patterns. Our approach is further refined with a grasp-aware loss…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
