GPT2MVS: Generative Pre-trained Transformer-2 for Multi-modal Video   Summarization

Jia-Hong Huang; Luka Murn; Marta Mrak; Marcel Worring

arXiv:2104.12465·cs.CV·April 27, 2021·1 cites

GPT2MVS: Generative Pre-trained Transformer-2 for Multi-modal Video Summarization

Jia-Hong Huang, Luka Murn, Marta Mrak, Marcel Worring

PDF

Open Access 4 Repos

TL;DR

This paper introduces GPT2MVS, a multi-modal video summarization model that uses specialized attention and contextualized representations to generate user-interest-driven summaries, outperforming existing methods.

Contribution

The paper presents a novel multi-modal video summarization approach leveraging a specialized attention network and contextualized word representations for improved accuracy.

Findings

01

+5.88% accuracy over state-of-the-art

02

+4.06% F1-score improvement

03

Effective in user-interest-driven video summarization

Abstract

Traditional video summarization methods generate fixed video representations regardless of user interest. Therefore such methods limit users' expectations in content search and exploration scenarios. Multi-modal video summarization is one of the methods utilized to address this problem. When multi-modal video summarization is used to help video exploration, a text-based query is considered as one of the main drivers of video summary generation, as it is user-defined. Thus, encoding the text-based query and the video effectively are both important for the task of multi-modal video summarization. In this work, a new method is proposed that uses a specialized attention network and contextualized word representations to tackle this task. The proposed model consists of a contextualized video summary controller, multi-modal attention mechanisms, an interactive attention network, and a video…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Music and Audio Processing · Advanced Image and Video Retrieval Techniques