Zero-Shot Temporal Interaction Localization for Egocentric Videos
Erhang Zhang, Junyi Ma, Yin-Dong Zheng, Yixuan Zhou, Hesheng Wang

TL;DR
EgoLoc is a novel zero-shot approach for accurately localizing human-object interaction timings in egocentric videos, leveraging adaptive sampling and feedback refinement to outperform existing methods.
Contribution
The paper introduces EgoLoc, a zero-shot TIL method that uses adaptive visual prompts and dynamic feedback for improved accuracy and efficiency in egocentric videos.
Findings
EgoLoc outperforms state-of-the-art baselines in localization accuracy.
The method effectively integrates 2D and 3D cues for initial guesses.
Closed-loop feedback enhances localization precision.
Abstract
Locating human-object interaction (HOI) actions within video serves as the foundation for multiple downstream tasks, such as human behavior analysis and human-robot skill transfer. Current temporal action localization methods typically rely on annotated action and object categories of interactions for optimization, which leads to domain bias and low deployment efficiency. Although some recent works have achieved zero-shot temporal action localization (ZS-TAL) with large vision-language models (VLMs), their coarse-grained estimations and open-loop pipelines hinder further performance improvements for temporal interaction localization (TIL). To address these issues, we propose a novel zero-shot TIL approach dubbed EgoLoc to locate the timings of grasp actions for human-object interaction in egocentric videos. EgoLoc introduces a self-adaptive sampling strategy to generate reasonable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Robot Manipulation and Learning
