EgoLoc: A Generalizable Solution for Temporal Interaction Localization in Egocentric Videos
Junyi Ma, Erhang Zhang, Yin-Dong Zheng, Yuchen Xie, Yixuan Zhou, Hesheng Wang

TL;DR
EgoLoc is a zero-shot method for precisely localizing hand-object contact and separation moments in egocentric videos, enhancing interaction understanding without requiring object masks or category annotations.
Contribution
The paper introduces EgoLoc, a novel zero-shot approach that leverages hand dynamics and vision-language models for accurate temporal interaction localization in egocentric videos.
Findings
EgoLoc achieves accurate contact/separation localization in egocentric videos.
The method generalizes well without object masks or category labels.
EgoLoc improves downstream tasks in egocentric vision and robotics.
Abstract
Analyzing hand-object interaction in egocentric vision facilitates VR/AR applications and human-robot policy transfer. Existing research has mostly focused on modeling the behavior paradigm of interactive actions (i.e., ``how to interact''). However, the more challenging and fine-grained problem of capturing the critical moments of contact and separation between the hand and the target object (i.e., ``when to interact'') is still underexplored, which is crucial for immersive interactive experiences in mixed reality and robotic motion planning. Therefore, we formulate this problem as temporal interaction localization (TIL). Some recent works extract semantic masks as TIL references, but suffer from inaccurate object grounding and cluttered scenarios. Although current temporal action localization (TAL) methods perform well in detecting verb-noun action segments, they rely on category…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
