MuSS: A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation

Haojie Zhang; Di Wu; Bingyan Liu; Linjie Zhong; Yuancheng Wei; Xingsong Ye; Nanqing Liu; and Yaling Liang

arXiv:2604.23789·cs.CV·May 12, 2026

MuSS: A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation

Haojie Zhang, Di Wu, Bingyan Liu, Linjie Zhong, Yuancheng Wei, Xingsong Ye, Nanqing Liu, and Yaling Liang

PDF

1 Repo

TL;DR

MuSS is a large-scale dataset and benchmark designed to improve multi-shot subject-to-video generation by addressing narrative coherence, alignment conflicts, and copy-paste issues, enabling more cinematic storytelling.

Contribution

We introduce MuSS, a novel dataset and benchmark with a progressive captioning pipeline and cross-shot matching to enhance multi-shot video generation and narrative consistency.

Findings

01

MuSS enables better narrative coherence in multi-shot video generation.

02

Our model achieves state-of-the-art performance in narrative effectiveness.

03

Current baselines struggle with continuous storytelling and structural consistency.

Abstract

While video foundation models excel at single-shot generation, real-world cinematic storytelling inherently relies on complex multi-shot sequencing. Further progress is constrained by the absence of datasets that address three core challenges: authentic narrative logic, spatiotemporal text-video alignment conflicts, and the "copy-paste" dilemma prevalent in Subject-to-Video (S2V) generation. To bridge this gap, we introduce MuSS, a large-scale, dual-track dataset tailored for multi-shot video and S2V generation. Sourced from over 3,000 movies, MuSS explicitly supports both complex montage transitions and subject-centric narratives. To construct this dataset, we pioneer a progressive captioning pipeline that eliminates contextual conflicts by ensuring local shot-level accuracy before enforcing global narrative coherence. Crucially, we implement a cross-shot matching mechanism to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zhang-haojie/MuSS
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.