Bernini: Latent Semantic Planning for Video Diffusion
Bernini Team: Chenchen Liu, Junyi Chen, Lei Li, Lu Chi, Mingzhen Sun, Zhuoying Li, Yi Fu, Ruoyu Guo, Yiheng Wu, Ge Bai, Zehuan Yuan

TL;DR
Bernini unifies semantic planning and pixel rendering in video generation and editing by combining multimodal language models with diffusion models, enabling efficient, high-quality video synthesis.
Contribution
Introducing Bernini, a framework that separates semantic planning from pixel rendering, allowing independent training and improved video generation and editing performance.
Findings
Achieves state-of-the-art results on video generation benchmarks.
Demonstrates strong generalization in editing tasks due to pretrained semantic understanding.
Introduces Segment-Aware 3D Rotary Positional Embedding for better multi-input handling.
Abstract
Multimodal large language models (MLLMs) and diffusion models have each reached remarkable maturity: MLLMs excel at reasoning over heterogeneous multimodal inputs with strong semantic grounding, while diffusion models synthesize images and videos with photorealistic fidelity. We argue that these two families can be unified through a simple division of labor: MLLMs perform semantic planning, while diffusion models render pixels from high-level semantic guidance and low-level visual features. Building on this idea, we propose Bernini, a unified framework for video generation and editing. An MLLM-based planner predicts the target semantic representation directly in the ViT embedding space, and a DiT-based renderer synthesizes pixels conditioned on this plan, augmented by text features and, for editing, source VAE features for detail preservation. Because semantics serve as the interface,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
