Object-Shot Enhanced Grounding Network for Egocentric Video

Yisen Feng; Haoyu Zhang; Meng Liu; Weili Guan; Liqiang Nie

arXiv:2505.04270·cs.CV·May 8, 2025

Object-Shot Enhanced Grounding Network for Egocentric Video

Yisen Feng, Haoyu Zhang, Meng Liu, Weili Guan, Liqiang Nie

PDF

Open Access 1 Repo

TL;DR

This paper introduces OSGNet, a novel egocentric video grounding model that leverages object information and shot movement analysis to improve modality alignment and achieve state-of-the-art results.

Contribution

The paper proposes a new method that incorporates object extraction and shot movement analysis to enhance egocentric video grounding performance.

Findings

01

Achieves state-of-the-art results on three datasets.

02

Effectively utilizes object information for better grounding.

03

Leverages shot movement features to model wearer attention.

Abstract

Egocentric video grounding is a crucial task for embodied intelligence applications, distinct from exocentric video moment localization. Existing methods primarily focus on the distributional differences between egocentric and exocentric videos but often neglect key characteristics of egocentric videos and the fine-grained information emphasized by question-type queries. To address these limitations, we propose OSGNet, an Object-Shot enhanced Grounding Network for egocentric video. Specifically, we extract object information from videos to enrich video representation, particularly for objects highlighted in the textual query but not directly captured in the video features. Additionally, we analyze the frequent shot movements inherent to egocentric videos, leveraging these features to extract the wearer's attention information, which enhances the model's ability to perform modality…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yisen-feng/osgnet
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis

MethodsSoftmax · Attention Is All You Need · Focus