CineTrans: Learning to Generate Videos with Cinematic Transitions via Masked Diffusion Models
Xiaoxue Wu, Bingjie Gao, Yu Qiao, Yaohui Wang, Xinyuan Chen

TL;DR
CineTrans is a novel framework that generates cinematic multi-shot videos with controlled transitions using masked diffusion models, trained on a new annotated dataset, and outperforms existing methods in quality and stability.
Contribution
The paper introduces CineTrans, a new method for multi-shot video generation with cinematic transitions, leveraging a mask-based control mechanism and a new annotated dataset.
Findings
CineTrans produces coherent multi-shot videos with cinematic transitions.
The mask-based control mechanism enables arbitrary transition placement.
CineTrans outperforms existing baselines in quality, transition control, and temporal consistency.
Abstract
Despite significant advances in video synthesis, research into multi-shot video generation remains in its infancy. Even with scaled-up models and massive datasets, the shot transition capabilities remain rudimentary and unstable, largely confining generated videos to single-shot sequences. In this work, we introduce CineTrans, a novel framework for generating coherent multi-shot videos with cinematic, film-style transitions. To facilitate insights into the film editing style, we construct a multi-shot video-text dataset Cine250K with detailed shot annotations. Furthermore, our analysis of existing video diffusion models uncovers a correspondence between attention maps in the diffusion model and shot boundaries, which we leverage to design a mask-based control mechanism that enables transitions at arbitrary positions and transfers effectively in a training-free setting. After fine-tuning…
Peer Reviews
Decision·ICLR 2026 Poster
- Clear, well-motivated mechanism: The block-diagonal mask is explicitly defined and integrated into the attention logits, with principled alignment to measured intra- vs inter-shot attention structure. - Strong empirical gains in control: Transition Control Score improves markedly over large T2V and multi-shot baselines while preserving quality. - Thoughtful evaluation design: The paper evaluates transition control, intra-/inter-shot consistency, and aesthetic quality, including a novel Consi
- Over-hard masking; missed opportunity for temporal scheduling: Equation (2) uses a binary mask with 0 on same-shot pairs and $-\infty$ across shots, which hard-zeros inter-shot attention in Equation (3). While effective, this may induce abrupt changes (also visible in the shared videos). A time-dependent or diffusion-step-dependent penalty could yield smoother transitions, e.g., replacing $-\infty$ by $-\alpha(t)$ that reaches $-\infty$ for a couple of time steps, this ramps near shot boundari
1. This paper contributes a large multi-shot dataset of 250K videos with frame-level shot boundaries and hierarchical captions. 2. The proposed method is simple and easy to follow. 3. The paper is well-structured and easy to read.
1. The method in this paper only adds a mask mechanism between shots to ensure content consistency, but this makes it difficult to maintain fine-grained consistency across different shots, especially for background regions, as shown on the left side of Figure 5. Table 1 (Intra-shot Consistency) also demonstrates that the improvement in consistency achieved by this method is limited compared to the baseline. 2. Although the paper claims to achieve cinematic transitions, the proposed method only s
The work introduces a novel mechanism that discovers and leverages a block-diagonal pattern in attention maps for transition control, directly inspiring an improved architecture; it further uses a mask-based strategy to align model internals with multi-shot video structure, enabling precise, shot-wise editing. A new large-scale dataset, Cine250K, fills a key gap with rich annotations (including frame-level labels and semantic stitching) and follows film-editing conventions. Extensive experiments
1. Lack of Analysis on Limiting Scenarios: The paper demonstrates strong results on curated prompts and the Cine250K distribution, but does not critically examine or quantify limitations outside this scope, e.g., severe domain shifts, failure cases, or fundamental breakdowns of mask-based control when transition points are ambiguous or overlap. 2. Limited Theoretical Rigor or Insights: While the empirical demonstration of attention map patterns is clear (see Figure 4), the work lacks theoretical
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition
