WSLLN: Weakly Supervised Natural Language Localization Networks

Mingfei Gao; Larry S. Davis; Richard Socher; Caiming Xiong

arXiv:1909.00239·cs.CV·September 4, 2019·5 cites

WSLLN: Weakly Supervised Natural Language Localization Networks

Mingfei Gao, Larry S. Davis, Richard Socher, Caiming Xiong

PDF

Open Access

TL;DR

WSLLN introduces a weakly supervised approach for language-based event localization in videos, significantly reducing annotation costs by learning from video-sentence pairs without needing explicit temporal annotations.

Contribution

The paper presents WSLLN, a novel end-to-end network that localizes events in videos using only weak supervision, outperforming existing methods on benchmark datasets.

Findings

01

Achieves state-of-the-art results on ActivityNet Captions

02

Reduces annotation effort by eliminating the need for temporal labels

03

Demonstrates effective segment-text matching in weakly supervised setting

Abstract

We propose weakly supervised language localization networks (WSLLN) to detect events in long, untrimmed videos given language queries. To learn the correspondence between visual segments and texts, most previous methods require temporal coordinates (start and end times) of events for training, which leads to high costs of annotation. WSLLN relieves the annotation burden by training with only video-sentence pairs without accessing to temporal locations of events. With a simple end-to-end structure, WSLLN measures segment-text consistency and conducts segment selection (conditioned on the text) simultaneously. Results from both are merged and optimized as a video-sentence matching problem. Experiments on ActivityNet Captions and DiDeMo demonstrate that WSLLN achieves state-of-the-art performance.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization