ShotVerse: Advancing Cinematic Camera Control for Text-Driven Multi-Shot Video Creation

Songlin Yang; Zhe Wang; Xuyi Yang; Songchun Zhang; Xianghao Kong; Taiyi Wu; Xiaotong Zhao; Ran Zhang; Alan Zhao; Anyi Rao

arXiv:2603.11421·cs.CV·March 13, 2026

ShotVerse: Advancing Cinematic Camera Control for Text-Driven Multi-Shot Video Creation

Songlin Yang, Zhe Wang, Xuyi Yang, Songchun Zhang, Xianghao Kong, Taiyi Wu, Xiaotong Zhao, Ran Zhang, Alan Zhao, Anyi Rao

PDF

Open Access

TL;DR

ShotVerse introduces a novel framework that combines vision-language models and automated calibration to enable precise, cinematic multi-shot video creation driven by textual prompts, overcoming manual and implicit control limitations.

Contribution

The paper presents a data-centric approach with a new planning-control framework and a curated cinematic dataset for text-driven multi-shot video generation.

Findings

01

Achieves camera-accurate, consistent multi-shot videos from text

02

Bridges gap between textual prompts and manual plotting

03

Outperforms existing methods in cinematic quality

Abstract

Text-driven video generation has democratized film creation, but camera control in cinematic multi-shot scenarios remains a significant block. Implicit textual prompts lack precision, while explicit trajectory conditioning imposes prohibitive manual overhead and often triggers execution failures in current models. To overcome this bottleneck, we propose a data-centric paradigm shift, positing that aligned (Caption, Trajectory, Video) triplets form an inherent joint distribution that can connect automated plotting and precise execution. Guided by this insight, we present ShotVerse, a "Plan-then-Control" framework that decouples generation into two collaborative agents: a VLM (Vision-Language Model)-based Planner that leverages spatial priors to obtain cinematic, globally aligned trajectories from text, and a Controller that renders these trajectories into multi-shot video content via a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Video Analysis and Summarization · Multimodal Machine Learning Applications