Zero-Shot Temporal Interaction Localization for Egocentric Videos

Erhang Zhang; Junyi Ma; Yin-Dong Zheng; Yixuan Zhou; Hesheng Wang

arXiv:2506.03662·cs.CV·November 17, 2025

Zero-Shot Temporal Interaction Localization for Egocentric Videos

Erhang Zhang, Junyi Ma, Yin-Dong Zheng, Yixuan Zhou, Hesheng Wang

PDF

Open Access 1 Repo

TL;DR

EgoLoc is a novel zero-shot approach for accurately localizing human-object interaction timings in egocentric videos, leveraging adaptive sampling and feedback refinement to outperform existing methods.

Contribution

The paper introduces EgoLoc, a zero-shot TIL method that uses adaptive visual prompts and dynamic feedback for improved accuracy and efficiency in egocentric videos.

Findings

01

EgoLoc outperforms state-of-the-art baselines in localization accuracy.

02

The method effectively integrates 2D and 3D cues for initial guesses.

03

Closed-loop feedback enhances localization precision.

Abstract

Locating human-object interaction (HOI) actions within video serves as the foundation for multiple downstream tasks, such as human behavior analysis and human-robot skill transfer. Current temporal action localization methods typically rely on annotated action and object categories of interactions for optimization, which leads to domain bias and low deployment efficiency. Although some recent works have achieved zero-shot temporal action localization (ZS-TAL) with large vision-language models (VLMs), their coarse-grained estimations and open-loop pipelines hinder further performance improvements for temporal interaction localization (TIL). To address these issues, we propose a novel zero-shot TIL approach dubbed EgoLoc to locate the timings of grasp actions for human-object interaction in egocentric videos. EgoLoc introduces a self-adaptive sampling strategy to generate reasonable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

irmvlab/egoloc
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Robot Manipulation and Learning