Show, Tell and Summarize: Dense Video Captioning Using Visual Cue Aided Sentence Summarization

Zhiwang Zhang; Dong Xu; Wanli Ouyang; Chuanqi Tan

arXiv:2506.20567·cs.CV·June 26, 2025

Show, Tell and Summarize: Dense Video Captioning Using Visual Cue Aided Sentence Summarization

Zhiwang Zhang, Dong Xu, Wanli Ouyang, Chuanqi Tan

PDF

Open Access

TL;DR

This paper introduces a novel dense video captioning framework that partitions videos into events, generates segment descriptions, and then summarizes these into a single sentence using a hierarchical LSTM with visual cues.

Contribution

The work proposes a new division-and-summarization framework with a hierarchical LSTM and attention mechanism for dense video captioning, improving the summarization of event descriptions.

Findings

01

Effective dense captioning demonstrated on ActivityNet dataset

02

Hierarchical LSTM with visual cues outperforms existing methods

03

Summarization achieves more coherent and comprehensive descriptions

Abstract

In this work, we propose a division-and-summarization (DaS) framework for dense video captioning. After partitioning each untrimmed long video as multiple event proposals, where each event proposal consists of a set of short video segments, we extract visual feature (e.g., C3D feature) from each segment and use the existing image/video captioning approach to generate one sentence description for this segment. Considering that the generated sentences contain rich semantic descriptions about the whole event proposal, we formulate the dense video captioning task as a visual cue aided sentence summarization problem and propose a new two stage Long Short Term Memory (LSTM) approach equipped with a new hierarchical attention mechanism to summarize all generated sentences as one descriptive sentence with the aid of visual features. Specifically, the first-stage LSTM network takes all semantic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Human Pose and Action Recognition

MethodsLong Short-Term Memory · Sparse Evolutionary Training