Learning to Summarize Videos by Contrasting Clips
Ivan Sosnovik, Artem Moskalev, Cees Kaandorp, Arnold Smeulders

TL;DR
This paper introduces an unsupervised video summarization method using contrastive learning, which effectively creates diverse summaries without relying on labeled data by contrasting top-k features.
Contribution
It proposes a novel contrastive learning framework that contrasts top-k features for unsupervised video summarization, improving diversity and informativeness of summaries.
Findings
Achieves meaningful summaries without labeled data
Outperforms existing methods on benchmark datasets
Enhances diversity of video summaries
Abstract
Video summarization aims at choosing parts of a video that narrate a story as close as possible to the original one. Most of the existing video summarization approaches focus on hand-crafted labels. As the number of videos grows exponentially, there emerges an increasing need for methods that can learn meaningful summarizations without labeled annotations. In this paper, we aim to maximally exploit unsupervised video summarization while concentrating the supervision to a few, personalized labels as an add-on. To do so, we formulate the key requirements for the informative video summarization. Then, we propose contrastive learning as the answer to both questions. To further boost Contrastive video Summarization (CSUM), we propose to contrast top-k features instead of a mean video feature as employed by the existing method, which we implement with a differentiable top-k feature selector.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Music and Audio Processing · Human Motion and Animation
MethodsContrastive Learning
