Relation-aware Video Reading Comprehension for Temporal Language   Grounding

Jialin Gao; Xin Sun; Mengmeng Xu; Xi Zhou; Bernard Ghanem

arXiv:2110.05717·cs.CV·December 2, 2021·5 cites

Relation-aware Video Reading Comprehension for Temporal Language Grounding

Jialin Gao, Xin Sun, Mengmeng Xu, Xi Zhou, Bernard Ghanem

PDF

Open Access 1 Repo

TL;DR

This paper introduces RaNet, a relation-aware network for temporal language grounding in videos, reformulating the task as video reading comprehension with enhanced cross-modal and inter-choice relation modeling.

Contribution

It proposes a novel framework that combines coarse-and-fine cross-modal interaction with graph-based relation modeling among video choices, advancing temporal grounding methods.

Findings

01

Outperforms existing methods on ActivityNet-Captions, TACoS, and Charades-STA datasets.

02

Effective modeling of choice relations improves grounding accuracy.

03

Codes are publicly available for reproducibility.

Abstract

Temporal language grounding in videos aims to localize the temporal span relevant to the given query sentence. Previous methods treat it either as a boundary regression task or a span extraction task. This paper will formulate temporal language grounding into video reading comprehension and propose a Relation-aware Network (RaNet) to address it. This framework aims to select a video moment choice from the predefined answer set with the aid of coarse-and-fine choice-query interaction and choice-choice relation construction. A choice-query interactor is proposed to match the visual and textual information simultaneously in sentence-moment and token-moment levels, leading to a coarse-and-fine cross-modal interaction. Moreover, a novel multi-choice relation constructor is introduced by leveraging graph convolution to capture the dependencies among video moment choices for the best choice…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Huntersxsx/RaNet
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization

MethodsConvolution