Towards Diverse Paragraph Captioning for Untrimmed Videos
Yuqing Song, Shizhe Chen, Qin Jin

TL;DR
This paper introduces a novel video paragraph captioning model that directly generates descriptive paragraphs for untrimmed videos, avoiding the unreliable event detection step and enhancing diversity and efficiency through dynamic attention and keyframe awareness.
Contribution
The proposed model eliminates the need for event detection, uses dynamic video memories for better coherence and diversity, and incorporates keyframe awareness for efficiency in untrimmed videos.
Findings
Outperforms state-of-the-art on ActivityNet and Charades datasets.
Achieves higher accuracy and diversity in paragraph generation.
Does not require event boundary annotations.
Abstract
Video paragraph captioning aims to describe multiple events in untrimmed videos with descriptive paragraphs. Existing approaches mainly solve the problem in two steps: event detection and then event captioning. Such two-step manner makes the quality of generated paragraphs highly dependent on the accuracy of event proposal detection which is already a challenging task. In this paper, we propose a paragraph captioning model which eschews the problematic event detection stage and directly generates paragraphs for untrimmed videos. To describe coherent and diverse events, we propose to enhance the conventional temporal attention with dynamic video memories, which progressively exposes new video features and suppresses over-accessed video contents to control visual focuses of the model. In addition, a diversity-driven training strategy is proposed to improve diversity of paragraph on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization
