Cap2Sum: Learning to Summarize Videos by Generating Captions
Cairong Zhao, Chutian Wang, Zifan Song, Guosheng Hu, Haonan Chen, Xiaofan Zhai

TL;DR
Cap2Sum introduces a weakly-supervised learning approach for video summarization by generating captions, leveraging large-scale dense caption datasets and a CLIP prior to improve performance and generalization, enabling zero-shot and fine-tuned summarization.
Contribution
The paper proposes Cap2Sum, a novel model that uses dense caption generation for video summarization and incorporates CLIP priors to enhance object recognition, enabling better generalization and zero-shot capabilities.
Findings
Significant performance improvements over previous methods.
Effective zero-shot video summarization demonstrated.
Introduction of new datasets TVSum-Caption and SumMe-Caption.
Abstract
With the rapid growth of video data on the internet, video summarization is becoming a very important AI technology. However, due to the high labelling cost of video summarization, existing studies have to be conducted on small-scale datasets, leading to limited performance and generalization capacity. In this work, we introduce the use of dense video captions as a supervision signal to train video summarization models. Motivated by this, we propose Cap2Sum, a model that learns to summarize videos by generating captions, to exploit dense video caption annotations. This weakly-supervised approach allows us to train the models on large-scale dense video caption datasets to achieve better performance and generalization capacity. To further improve the generalization capacity, we introduce a CLIP (a strong vision-language model) Prior mechanism to enhance the learning of important objects…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Natural Language Processing Techniques
