Dense Video Captioning using Graph-based Sentence Summarization

Zhiwang Zhang; Dong Xu; Wanli Ouyang; Luping Zhou

arXiv:2506.20583·cs.CV·June 26, 2025

Dense Video Captioning using Graph-based Sentence Summarization

Zhiwang Zhang, Dong Xu, Wanli Ouyang, Luping Zhou

PDF

Open Access

TL;DR

This paper introduces a graph-based framework for dense video captioning that improves scene understanding by summarizing sequences of video segments into comprehensive descriptions, leveraging semantic word relationships.

Contribution

The paper proposes a novel GPaS framework with a GCN-LSTM interaction module for better scene evolution modeling in dense video captioning, focusing on the summarization stage.

Findings

01

Outperforms state-of-the-art on ActivityNet Captions dataset

02

Effective in capturing scene evolution over long proposals

03

Improves caption quality by exploiting semantic word relationships

Abstract

Recently, dense video captioning has made attractive progress in detecting and captioning all events in a long untrimmed video. Despite promising results were achieved, most existing methods do not sufficiently explore the scene evolution within an event temporal proposal for captioning, and therefore perform less satisfactorily when the scenes and objects change over a relatively long proposal. To address this problem, we propose a graph-based partition-and-summarization (GPaS) framework for dense video captioning within two stages. For the ``partition" stage, a whole event proposal is split into short video segments for captioning at a finer level. For the ``summarization" stage, the generated sentences carrying rich description information for each segment are summarized into one sentence to describe the whole event. We particularly focus on the ``summarization" stage, and propose a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Human Pose and Action Recognition

MethodsLong Short-Term Memory · Graph Convolutional Network · Focus