Object-Centric Framework for Video Moment Retrieval
Zongyao Li, Yongkang Wong, Satoshi Yamazaki, Jianquan Liu, Mohan Kankanhalli

TL;DR
This paper introduces an object-centric framework for video moment retrieval that leverages scene graphs and relational transformers to improve localization accuracy for object-oriented queries, outperforming existing methods.
Contribution
The paper presents a novel object-centric approach using scene graphs and relational transformers to enhance video moment retrieval with detailed object-level reasoning.
Findings
Outperforms state-of-the-art on Charades-STA, QVHighlights, TACoS
Effectively captures object interactions and state changes over time
Improves localization accuracy for object-oriented queries
Abstract
Most existing video moment retrieval methods rely on temporal sequences of frame- or clip-level features that primarily encode global visual and semantic information. However, such representations often fail to capture fine-grained object semantics and appearance, which are crucial for localizing moments described by object-oriented queries involving specific entities and their interactions. In particular, temporal dynamics at the object level have been largely overlooked, limiting the effectiveness of existing approaches in scenarios requiring detailed object-level reasoning. To address this limitation, we propose a novel object-centric framework for moment retrieval. Our method first extracts query-relevant objects using a scene graph parser and then generates scene graphs from video frames to represent these objects and their relationships. Based on the scene graphs, we construct…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization
