DORi: Discovering Object Relationship for Moment Localization of a Natural-Language Query in Video
Cristian Rodriguez-Opazo, Edison Marrese-Taylor, Basura Fernando, and Hongdong Li, Stephen Gould

TL;DR
This paper introduces DORi, a novel approach for localizing specific moments in videos based on natural language queries by learning a language-conditioned video embedding that captures object, human, and activity relationships.
Contribution
The paper proposes a new message-passing algorithm that models spatial and temporal relationships in videos conditioned on language queries for improved moment localization.
Findings
Outperforms state-of-the-art on three benchmark datasets
Introduces YouCookII as a new benchmark dataset
Effective in capturing complex object-human-activity relationships
Abstract
This paper studies the task of temporal moment localization in a long untrimmed video using natural language query. Given a query sentence, the goal is to determine the start and end of the relevant segment within the video. Our key innovation is to learn a video feature embedding through a language-conditioned message-passing algorithm suitable for temporal moment localization which captures the relationships between humans, objects and activities in the video. These relationships are obtained by a spatial sub-graph that contextualizes the scene representation using detected objects and human features conditioned in the language query. Moreover, a temporal sub-graph captures the activities within the video through time. Our method is evaluated on three standard benchmark datasets, and we also introduce YouCookII as a new benchmark for this task. Experiments show our method outperforms…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization
