Towards Diverse Paragraph Captioning for Untrimmed Videos

Yuqing Song; Shizhe Chen; Qin Jin

arXiv:2105.14477·cs.CV·June 1, 2021

Towards Diverse Paragraph Captioning for Untrimmed Videos

Yuqing Song, Shizhe Chen, Qin Jin

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel video paragraph captioning model that directly generates descriptive paragraphs for untrimmed videos, avoiding the unreliable event detection step and enhancing diversity and efficiency through dynamic attention and keyframe awareness.

Contribution

The proposed model eliminates the need for event detection, uses dynamic video memories for better coherence and diversity, and incorporates keyframe awareness for efficiency in untrimmed videos.

Findings

01

Outperforms state-of-the-art on ActivityNet and Charades datasets.

02

Achieves higher accuracy and diversity in paragraph generation.

03

Does not require event boundary annotations.

Abstract

Video paragraph captioning aims to describe multiple events in untrimmed videos with descriptive paragraphs. Existing approaches mainly solve the problem in two steps: event detection and then event captioning. Such two-step manner makes the quality of generated paragraphs highly dependent on the accuracy of event proposal detection which is already a challenging task. In this paper, we propose a paragraph captioning model which eschews the problematic event detection stage and directly generates paragraphs for untrimmed videos. To describe coherent and diverse events, we propose to enhance the conventional temporal attention with dynamic video memories, which progressively exposes new video features and suppresses over-accessed video contents to control visual focuses of the model. In addition, a diversity-driven training strategy is proposed to improve diversity of paragraph on the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

syuqings/video-paragraph
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization