Object-Centric Framework for Video Moment Retrieval

Zongyao Li; Yongkang Wong; Satoshi Yamazaki; Jianquan Liu; Mohan Kankanhalli

arXiv:2512.18448·cs.CV·December 23, 2025

Object-Centric Framework for Video Moment Retrieval

Zongyao Li, Yongkang Wong, Satoshi Yamazaki, Jianquan Liu, Mohan Kankanhalli

PDF

Open Access 1 Video

TL;DR

This paper introduces an object-centric framework for video moment retrieval that leverages scene graphs and relational transformers to improve localization accuracy for object-oriented queries, outperforming existing methods.

Contribution

The paper presents a novel object-centric approach using scene graphs and relational transformers to enhance video moment retrieval with detailed object-level reasoning.

Findings

01

Outperforms state-of-the-art on Charades-STA, QVHighlights, TACoS

02

Effectively captures object interactions and state changes over time

03

Improves localization accuracy for object-oriented queries

Abstract

Most existing video moment retrieval methods rely on temporal sequences of frame- or clip-level features that primarily encode global visual and semantic information. However, such representations often fail to capture fine-grained object semantics and appearance, which are crucial for localizing moments described by object-oriented queries involving specific entities and their interactions. In particular, temporal dynamics at the object level have been largely overlooked, limiting the effectiveness of existing approaches in scenarios requiring detailed object-level reasoning. To address this limitation, we propose a novel object-centric framework for moment retrieval. Our method first extracts query-relevant objects using a scene graph parser and then generates scene graphs from video frames to represent these objects and their relationships. Based on the scene graphs, we construct…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Object-Centric Framework for Video Moment Retrieval· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization