Relation-aware Video Reading Comprehension for Temporal Language Grounding
Jialin Gao, Xin Sun, Mengmeng Xu, Xi Zhou, Bernard Ghanem

TL;DR
This paper introduces RaNet, a relation-aware network for temporal language grounding in videos, reformulating the task as video reading comprehension with enhanced cross-modal and inter-choice relation modeling.
Contribution
It proposes a novel framework that combines coarse-and-fine cross-modal interaction with graph-based relation modeling among video choices, advancing temporal grounding methods.
Findings
Outperforms existing methods on ActivityNet-Captions, TACoS, and Charades-STA datasets.
Effective modeling of choice relations improves grounding accuracy.
Codes are publicly available for reproducibility.
Abstract
Temporal language grounding in videos aims to localize the temporal span relevant to the given query sentence. Previous methods treat it either as a boundary regression task or a span extraction task. This paper will formulate temporal language grounding into video reading comprehension and propose a Relation-aware Network (RaNet) to address it. This framework aims to select a video moment choice from the predefined answer set with the aid of coarse-and-fine choice-query interaction and choice-choice relation construction. A choice-query interactor is proposed to match the visual and textual information simultaneously in sentence-moment and token-moment levels, leading to a coarse-and-fine cross-modal interaction. Moreover, a novel multi-choice relation constructor is introduced by leveraging graph convolution to capture the dependencies among video moment choices for the best choice…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization
MethodsConvolution
