Weakly Supervised Dense Video Captioning
Zhiqiang Shen, Jianguo Li, Zhou Su, Minjun Li, Yurong Chen, and Yu-Gang Jiang, Xiangyang Xue

TL;DR
This paper introduces a weakly supervised dense video captioning approach that generates multiple diverse captions without requiring detailed annotations, outperforming existing methods.
Contribution
It proposes a novel weakly supervised framework combining Lexical-FCN, submodular maximization, and sequence-to-sequence models for dense video captioning.
Findings
Outperforms state-of-the-art single captioning methods
Produces diverse and informative dense captions
Effectively links video regions with lexical labels using weak supervision
Abstract
This paper focuses on a novel and challenging vision task, dense video captioning, which aims to automatically describe a video clip with multiple informative and diverse caption sentences. The proposed method is trained without explicit annotation of fine-grained sentence to video region-sequence correspondence, but is only based on weak video-level sentence annotations. It differs from existing video captioning systems in three technical aspects. First, we propose lexical fully convolutional neural networks (Lexical-FCN) with weakly supervised multi-instance multi-label learning to weakly link video regions with lexical labels. Second, we introduce a novel submodular maximization scheme to generate multiple informative and diverse region-sequences based on the Lexical-FCN outputs. A winner-takes-all scheme is adopted to weakly associate sentences to region-sequences in the training…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Human Pose and Action Recognition
