Cap2Sum: Learning to Summarize Videos by Generating Captions

Cairong Zhao; Chutian Wang; Zifan Song; Guosheng Hu; Haonan Chen; Xiaofan Zhai

arXiv:2408.12800·cs.MM·January 13, 2026

Cap2Sum: Learning to Summarize Videos by Generating Captions

Cairong Zhao, Chutian Wang, Zifan Song, Guosheng Hu, Haonan Chen, Xiaofan Zhai

PDF

Open Access

TL;DR

Cap2Sum introduces a weakly-supervised learning approach for video summarization by generating captions, leveraging large-scale dense caption datasets and a CLIP prior to improve performance and generalization, enabling zero-shot and fine-tuned summarization.

Contribution

The paper proposes Cap2Sum, a novel model that uses dense caption generation for video summarization and incorporates CLIP priors to enhance object recognition, enabling better generalization and zero-shot capabilities.

Findings

01

Significant performance improvements over previous methods.

02

Effective zero-shot video summarization demonstrated.

03

Introduction of new datasets TVSum-Caption and SumMe-Caption.

Abstract

With the rapid growth of video data on the internet, video summarization is becoming a very important AI technology. However, due to the high labelling cost of video summarization, existing studies have to be conducted on small-scale datasets, leading to limited performance and generalization capacity. In this work, we introduce the use of dense video captions as a supervision signal to train video summarization models. Motivated by this, we propose Cap2Sum, a model that learns to summarize videos by generating captions, to exploit dense video caption annotations. This weakly-supervised approach allows us to train the models on large-scale dense video caption datasets to achieve better performance and generalization capacity. To further improve the generalization capacity, we introduce a CLIP (a strong vision-language model) Prior mechanism to enhance the learning of important objects…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Natural Language Processing Techniques