Personalized Video Summarization by Multimodal Video Understanding

Brian Chen; Xiangyuan Zhao; Yingnan Zhu

arXiv:2411.03531·cs.CV·November 7, 2024

Personalized Video Summarization by Multimodal Video Understanding

Brian Chen, Xiangyuan Zhao, Yingnan Zhu

PDF

TL;DR

This paper introduces a new benchmark and a multimodal pipeline for personalized video summarization that leverages pre-trained visual language models to adapt to user preferences without extensive training.

Contribution

The authors propose a novel pipeline called VSL that uses semantic analysis of video and captions for personalized summarization, avoiding large training datasets.

Findings

01

Outperforms state-of-the-art unsupervised models

02

More adaptable across datasets than supervised models

03

Efficient runtime suitable for scaling

Abstract

Video summarization techniques have been proven to improve the overall user experience when it comes to accessing and comprehending video content. If the user's preference is known, video summarization can identify significant information or relevant content from an input video, aiding them in obtaining the necessary information or determining their interest in watching the original video. Adapting video summarization to various types of video and user preferences requires significant training data and expensive human labeling. To facilitate such research, we proposed a new benchmark for video summarization that captures various user preferences. Also, we present a pipeline called Video Summarization with Language (VSL) for user-preferred video summarization that is based on pre-trained visual language models (VLMs) to avoid the need to train a video summarization system on a large…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.