Look Closer to Ground Better: Weakly-Supervised Temporal Grounding of   Sentence in Video

Zhenfang Chen; Lin Ma; Wenhan Luo; Peng Tang; Kwan-Yee K. Wong

arXiv:2001.09308·cs.CV·January 28, 2020·48 cites

Look Closer to Ground Better: Weakly-Supervised Temporal Grounding of Sentence in Video

Zhenfang Chen, Lin Ma, Wenhan Luo, Peng Tang, Kwan-Yee K. Wong

PDF

Open Access

TL;DR

This paper introduces a two-stage weakly-supervised model for localizing sentence descriptions within untrimmed videos without relying on temporal annotations, achieving strong results on benchmark datasets.

Contribution

It proposes a novel coarse-to-fine approach for temporal grounding of sentences in videos under weak supervision, eliminating the need for temporal annotations during training.

Findings

01

Effective in localizing sentence segments in videos

02

Outperforms existing weakly-supervised methods

03

Validated on ActivityNet Captions and Charades-STA datasets

Abstract

In this paper, we study the problem of weakly-supervised temporal grounding of sentence in video. Specifically, given an untrimmed video and a query sentence, our goal is to localize a temporal segment in the video that semantically corresponds to the query sentence, with no reliance on any temporal annotation during training. We propose a two-stage model to tackle this problem in a coarse-to-fine manner. In the coarse stage, we first generate a set of fixed-length temporal proposals using multi-scale sliding windows, and match their visual features against the sentence features to identify the best-matched proposal as a coarse grounding result. In the fine stage, we perform a fine-grained matching between the visual features of the frames in the best-matched proposal and the sentence features to locate the precise frame boundary of the fine grounding result. Comprehensive experiments…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Human Pose and Action Recognition