Weakly-Supervised Multi-Level Attentional Reconstruction Network for   Grounding Textual Queries in Videos

Yijun Song; Jingwen Wang; Lin Ma; Zhou Yu; Jun Yu

arXiv:2003.07048·cs.CV·March 17, 2020·51 cites

Weakly-Supervised Multi-Level Attentional Reconstruction Network for Grounding Textual Queries in Videos

Yijun Song, Jingwen Wang, Lin Ma, Zhou Yu, Jun Yu

PDF

Open Access

TL;DR

This paper introduces MARN, a weakly-supervised model for localizing video segments based on textual queries, using only video-sentence pairs and attention mechanisms to improve accuracy without requiring detailed annotations.

Contribution

The paper proposes a novel multi-level attentional reconstruction network that leverages intra- and inter-proposal information for weakly-supervised video grounding, improving over existing methods.

Findings

01

MARN outperforms existing weakly-supervised methods on Charades-STA and ActivityNet-Captions datasets.

02

The proposed proposal sampling and attention mechanisms enhance proposal representation and localization accuracy.

03

The model effectively utilizes intra- and inter-proposal clues to improve grounding performance.

Abstract

The task of temporally grounding textual queries in videos is to localize one video segment that semantically corresponds to the given query. Most of the existing approaches rely on segment-sentence pairs (temporal annotations) for training, which are usually unavailable in real-world scenarios. In this work we present an effective weakly-supervised model, named as Multi-Level Attentional Reconstruction Network (MARN), which only relies on video-sentence pairs during the training stage. The proposed method leverages the idea of attentional reconstruction and directly scores the candidate segments with the learnt proposal-level attentions. Moreover, another branch learning clip-level attention is exploited to refine the proposals at both the training and testing stage. We develop a novel proposal sampling mechanism to leverage intra-proposal information for learning better proposal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization

MethodsConvolution