SD-VSum: A Method and Dataset for Script-Driven Video Summarization

Manolis Mylonas; Evlampios Apostolidis; Vasileios Mezaris

arXiv:2505.03319·cs.CV·September 23, 2025

SD-VSum: A Method and Dataset for Script-Driven Video Summarization

Manolis Mylonas, Evlampios Apostolidis, Vasileios Mezaris

PDF

Open Access 1 Repo

TL;DR

This paper introduces a new task of script-driven video summarization, extends a dataset with descriptions for training, and proposes a novel cross-modal attention network that outperforms existing methods in generating user-specific video summaries.

Contribution

The paper presents a new task, extends a dataset with descriptions, and develops a novel cross-modal attention network for script-driven video summarization.

Findings

01

SD-VSum outperforms state-of-the-art methods in experiments.

02

The dataset extension enables training of script-driven summarization models.

03

The approach produces summaries tailored to user scripts and needs.

Abstract

In this work, we introduce the task of script-driven video summarization, which aims to produce a summary of the full-length video by selecting the parts that are most relevant to a user-provided script outlining the visual content of the desired summary. Following, we extend a recently-introduced large-scale dataset for generic video summarization (VideoXum) by producing natural language descriptions of the different human-annotated summaries that are available per video. In this way we make it compatible with the introduced task, since the available triplets of ``video, summary and summary description'' can be used for training a method that is able to produce different summaries for a given video, driven by the provided script about the content of each summary. Finally, we develop a new network architecture for script-driven video summarization (SD-VSum), that employs a cross-modal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

idt-iti/sd-vsum
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Generative Adversarial Networks and Image Synthesis · Digital Media Forensic Detection

MethodsSoftmax · Attention Is All You Need