GPT2MVS: Generative Pre-trained Transformer-2 for Multi-modal Video Summarization
Jia-Hong Huang, Luka Murn, Marta Mrak, Marcel Worring

TL;DR
This paper introduces GPT2MVS, a multi-modal video summarization model that uses specialized attention and contextualized representations to generate user-interest-driven summaries, outperforming existing methods.
Contribution
The paper presents a novel multi-modal video summarization approach leveraging a specialized attention network and contextualized word representations for improved accuracy.
Findings
+5.88% accuracy over state-of-the-art
+4.06% F1-score improvement
Effective in user-interest-driven video summarization
Abstract
Traditional video summarization methods generate fixed video representations regardless of user interest. Therefore such methods limit users' expectations in content search and exploration scenarios. Multi-modal video summarization is one of the methods utilized to address this problem. When multi-modal video summarization is used to help video exploration, a text-based query is considered as one of the main drivers of video summary generation, as it is user-defined. Thus, encoding the text-based query and the video effectively are both important for the task of multi-modal video summarization. In this work, a new method is proposed that uses a specialized attention network and contextualized word representations to tackle this task. The proposed model consists of a contextualized video summary controller, multi-modal attention mechanisms, an interactive attention network, and a video…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Music and Audio Processing · Advanced Image and Video Retrieval Techniques
