TL;DR
MuSS is a large-scale dataset and benchmark designed to improve multi-shot subject-to-video generation by addressing narrative coherence, alignment conflicts, and copy-paste issues, enabling more cinematic storytelling.
Contribution
We introduce MuSS, a novel dataset and benchmark with a progressive captioning pipeline and cross-shot matching to enhance multi-shot video generation and narrative consistency.
Findings
MuSS enables better narrative coherence in multi-shot video generation.
Our model achieves state-of-the-art performance in narrative effectiveness.
Current baselines struggle with continuous storytelling and structural consistency.
Abstract
While video foundation models excel at single-shot generation, real-world cinematic storytelling inherently relies on complex multi-shot sequencing. Further progress is constrained by the absence of datasets that address three core challenges: authentic narrative logic, spatiotemporal text-video alignment conflicts, and the "copy-paste" dilemma prevalent in Subject-to-Video (S2V) generation. To bridge this gap, we introduce MuSS, a large-scale, dual-track dataset tailored for multi-shot video and S2V generation. Sourced from over 3,000 movies, MuSS explicitly supports both complex montage transitions and subject-centric narratives. To construct this dataset, we pioneer a progressive captioning pipeline that eliminates contextual conflicts by ensuring local shot-level accuracy before enforcing global narrative coherence. Crucially, we implement a cross-shot matching mechanism to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
