Captain Cinema: Towards Short Movie Generation
Junfei Xiao, Ceyuan Yang, Lvmin Zhang, Shengqu Cai, Yang Zhao, Yuwei Guo, Gordon Wetzstein, Maneesh Agrawala, Alan Yuille, and Lu Jiang

TL;DR
Captain Cinema introduces a novel framework for generating short movies from detailed text descriptions by planning keyframes and synthesizing videos with long-range coherence, using a specialized training strategy for cinematic data.
Contribution
The paper presents a new end-to-end framework combining top-down keyframe planning and bottom-up video synthesis, with an interleaved training method for long-context cinematic video generation.
Findings
Produces visually coherent short movies from text descriptions.
Supports long narrative coherence with high-quality visuals.
Efficient generation suitable for cinematic applications.
Abstract
We present Captain Cinema, a generation framework for short movie generation. Given a detailed textual description of a movie storyline, our approach firstly generates a sequence of keyframes that outline the entire narrative, which ensures long-range coherence in both the storyline and visual appearance (e.g., scenes and characters). We refer to this step as top-down keyframe planning. These keyframes then serve as conditioning signals for a video synthesis model, which supports long context learning, to produce the spatio-temporal dynamics between them. This step is referred to as bottom-up video synthesis. To support stable and efficient generation of multi-scene long narrative cinematic works, we introduce an interleaved training strategy for Multimodal Diffusion Transformers (MM-DiT), specifically adapted for long-context video data. Our model is trained on a specially curated…
Peer Reviews
Decision·ICLR 2026 Poster
1. The overall framework is simple and makes sense. It mainly consists of a two-stage Architecture: The decomposition of the problem into top-down planning (storyboarding) and bottom-up synthesis (animating) is logical and effective. This enables multi-scene generation with better narrative coherence. 2. GoldenMem Context Compression: The GoldenMem technique introduces an inverse-Fibonacci downsampling protocol, maintaining rich long-range context while keeping token budgets tractable. Ablative
1. **Limited Theoretical Analysis of GoldenMem**: The mathematical treatment of GoldenMem in Section 3.2, while operationally clear, lacks a principled analysis of compression trade-offs. The geometric decay is motivated by token counts, but there is little insight into semantic loss or the optimality of the selection policy. For example, while the summation gives a neat token bound, the information preserved under this decay is not theoretically or empirically analyzed—could less aggressive or
1. This work makes a preliminary exploration of multi-scene, whole-movie generation. Its data acquisition and processing, as well as the proposed keyframe planning and keyframe-conditioned video generation, will contribute to the community and inspire future work. 2. The introduced GoldenMem mechanism is technically sound and effective, significantly reducing the token budget under long contexts, as shown in Figure 4. Both GoldenMem and Hybrid Attention Masking alleviate computational explosion
1. There appears to be no qualitative comparison between Captain Cinema and existing methods. Based on the provided videos, the diversity and richness of the generated scenes are still limited, with no significant difference compared to LCT. 2. The GoldenMem mechanism is somewhat similar to FramePack, which diminishes its novelty. 3. The keyframe-based approach cannot guarantee temporal consistency between shots, and the proposed keyframe-conditioned video generation model does not seem to be
1. Clear decomposition into planning + synthesis, separating long-range narrative/keyframe planning from local spatio-temporal generation addresses coherence at different timescales and makes the overall task more tractable. 2. Long-context modeling focus of adapting MM-DiT with an interleaved training strategy targets one of the core failure modes of generative video (losing global narrative or character consistency across scenes). 3. Use of keyframes as conditioning signals where explicit ke
1. The method’s success hinges on the generated keyframes accurately reflecting the narrative and desired visual style; failure modes or hallucinated keyframes could produce hard-to-correct artifacts in the final video. Analysis on that could be useful. 2. Conditioning on keyframes constrains endpoints, but plausible and temporally coherent transitions between them remain challenging; motion realism, continuity of lighting, and consistent character actions may still suffer. Any comments on tha
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCinema and Media Studies
