SD-VSum: A Method and Dataset for Script-Driven Video Summarization
Manolis Mylonas, Evlampios Apostolidis, Vasileios Mezaris

TL;DR
This paper introduces a new task of script-driven video summarization, extends a dataset with descriptions for training, and proposes a novel cross-modal attention network that outperforms existing methods in generating user-specific video summaries.
Contribution
The paper presents a new task, extends a dataset with descriptions, and develops a novel cross-modal attention network for script-driven video summarization.
Findings
SD-VSum outperforms state-of-the-art methods in experiments.
The dataset extension enables training of script-driven summarization models.
The approach produces summaries tailored to user scripts and needs.
Abstract
In this work, we introduce the task of script-driven video summarization, which aims to produce a summary of the full-length video by selecting the parts that are most relevant to a user-provided script outlining the visual content of the desired summary. Following, we extend a recently-introduced large-scale dataset for generic video summarization (VideoXum) by producing natural language descriptions of the different human-annotated summaries that are available per video. In this way we make it compatible with the introduced task, since the available triplets of ``video, summary and summary description'' can be used for training a method that is able to produce different summaries for a given video, driven by the provided script about the content of each summary. Finally, we develop a new network architecture for script-driven video summarization (SD-VSum), that employs a cross-modal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Generative Adversarial Networks and Image Synthesis · Digital Media Forensic Detection
MethodsSoftmax · Attention Is All You Need
