Language-Guided Self-Supervised Video Summarization Using Text Semantic   Matching Considering the Diversity of the Video

Tomoya Sugihara; Shuntaro Masuda; Ling Xiao; Toshihiko Yamasaki

arXiv:2405.08890·cs.CV·August 21, 2024

Language-Guided Self-Supervised Video Summarization Using Text Semantic Matching Considering the Diversity of the Video

Tomoya Sugihara, Shuntaro Masuda, Ling Xiao, Toshihiko Yamasaki

PDF

Open Access

TL;DR

This paper introduces a self-supervised video summarization approach that leverages large language models to generate and compare captions, enabling diversity-aware and personalized video summaries without manual annotations.

Contribution

It transforms video summarization into an NLP task using LLMs, introduces a novel diversity-aware loss function, and achieves state-of-the-art results on SumMe dataset.

Findings

01

State-of-the-art performance on SumMe dataset

02

Effective diversity-aware video summarization

03

Personalized summarization capability

Abstract

Current video summarization methods rely heavily on supervised computer vision techniques, which demands time-consuming and subjective manual annotations. To overcome these limitations, we investigated self-supervised video summarization. Inspired by the success of Large Language Models (LLMs), we explored the feasibility in transforming the video summarization task into a Natural Language Processing (NLP) task. By leveraging the advantages of LLMs in context understanding, we aim to enhance the effectiveness of self-supervised video summarization. Our method begins by generating captions for individual video frames, which are then synthesized into text summaries by LLMs. Subsequently, we measure semantic distance between the captions and the text summary. Notably, we propose a novel loss function to optimize our model according to the diversity of the video. Finally, the summarized…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Image Retrieval and Classification Techniques · Computational and Text Analysis Methods