Weakly Supervised Dense Video Captioning

Zhiqiang Shen; Jianguo Li; Zhou Su; Minjun Li; Yurong Chen; and Yu-Gang Jiang; Xiangyang Xue

arXiv:1704.01502·cs.CV·April 6, 2017·27 cites

Weakly Supervised Dense Video Captioning

Zhiqiang Shen, Jianguo Li, Zhou Su, Minjun Li, Yurong Chen, and Yu-Gang Jiang, Xiangyang Xue

PDF

Open Access

TL;DR

This paper introduces a weakly supervised dense video captioning approach that generates multiple diverse captions without requiring detailed annotations, outperforming existing methods.

Contribution

It proposes a novel weakly supervised framework combining Lexical-FCN, submodular maximization, and sequence-to-sequence models for dense video captioning.

Findings

01

Outperforms state-of-the-art single captioning methods

02

Produces diverse and informative dense captions

03

Effectively links video regions with lexical labels using weak supervision

Abstract

This paper focuses on a novel and challenging vision task, dense video captioning, which aims to automatically describe a video clip with multiple informative and diverse caption sentences. The proposed method is trained without explicit annotation of fine-grained sentence to video region-sequence correspondence, but is only based on weak video-level sentence annotations. It differs from existing video captioning systems in three technical aspects. First, we propose lexical fully convolutional neural networks (Lexical-FCN) with weakly supervised multi-instance multi-label learning to weakly link video regions with lexical labels. Second, we introduce a novel submodular maximization scheme to generate multiple informative and diverse region-sequences based on the Lexical-FCN outputs. A winner-takes-all scheme is adopted to weakly associate sentences to region-sequences in the training…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Human Pose and Action Recognition