Personalized Video Summarization by Multimodal Video Understanding
Brian Chen, Xiangyuan Zhao, Yingnan Zhu

TL;DR
This paper introduces a new benchmark and a multimodal pipeline for personalized video summarization that leverages pre-trained visual language models to adapt to user preferences without extensive training.
Contribution
The authors propose a novel pipeline called VSL that uses semantic analysis of video and captions for personalized summarization, avoiding large training datasets.
Findings
Outperforms state-of-the-art unsupervised models
More adaptable across datasets than supervised models
Efficient runtime suitable for scaling
Abstract
Video summarization techniques have been proven to improve the overall user experience when it comes to accessing and comprehending video content. If the user's preference is known, video summarization can identify significant information or relevant content from an input video, aiding them in obtaining the necessary information or determining their interest in watching the original video. Adapting video summarization to various types of video and user preferences requires significant training data and expensive human labeling. To facilitate such research, we proposed a new benchmark for video summarization that captures various user preferences. Also, we present a pipeline called Video Summarization with Language (VSL) for user-preferred video summarization that is based on pre-trained visual language models (VLMs) to avoid the need to train a video summarization system on a large…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
