TL;DR
This paper introduces a Hierarchical Deep Residual Reasoning model for more precise temporal moment localization in videos, leveraging multi-level semantic representations and adaptive feature fusion.
Contribution
The paper presents a novel hierarchical reasoning framework that decomposes video and sentence semantics for finer localization, and introduces Res-BiGRUs for adaptive feature fusion.
Findings
HDRR outperforms state-of-the-art methods on Charades-STA and ActivityNet-Captions datasets.
Hierarchical semantic decomposition improves localization accuracy.
Res-BiGRUs effectively handle varying video resolutions and sentence lengths.
Abstract
Temporal Moment Localization (TML) in untrimmed videos is a challenging task in the field of multimedia, which aims at localizing the start and end points of the activity in the video, described by a sentence query. Existing methods mainly focus on mining the correlation between video and sentence representations or investigating the fusion manner of the two modalities. These works mainly understand the video and sentence coarsely, ignoring the fact that a sentence can be understood from various semantics, and the dominant words affecting the moment localization in the semantics are the action and object reference. Toward this end, we propose a Hierarchical Deep Residual Reasoning (HDRR) model, which decomposes the video and sentence into multi-level representations with different semantics to achieve a finer-grained localization. Furthermore, considering that videos with different…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
