CineScene: Implicit 3D as Effective Scene Representation for Cinematic Video Generation

Kaiyi Huang; Yukun Huang; Yu Li; Jianhong Bai; Xintao Wang; Zinan Lin; Xuefei Ning; Jiwen Yu; Pengfei Wan; Yu Wang; Xihui Liu

arXiv:2602.06959·cs.CV·February 9, 2026

CineScene: Implicit 3D as Effective Scene Representation for Cinematic Video Generation

Kaiyi Huang, Yukun Huang, Yu Li, Jianhong Bai, Xintao Wang, Zinan Lin, Xuefei Ning, Jiwen Yu, Pengfei Wan, Yu Wang, Xihui Liu

PDF

Open Access 1 Datasets

TL;DR

CineScene introduces an implicit 3D-aware scene representation framework for cinematic video generation, enabling scene consistency and camera control in synthesized videos from static environment images.

Contribution

The paper proposes a novel implicit 3D scene representation and context conditioning mechanism for cinematic video synthesis, along with a new dataset for training and evaluation.

Findings

01

Achieves state-of-the-art scene consistency in generated videos

02

Handles large camera movements effectively

03

Generalizes across diverse environments

Abstract

Cinematic video production requires control over scene-subject composition and camera movement, but live-action shooting remains costly due to the need for constructing physical sets. To address this, we introduce the task of cinematic video generation with decoupled scene context: given multiple images of a static environment, the goal is to synthesize high-quality videos featuring dynamic subject while preserving the underlying scene consistency and following a user-specified camera trajectory. We present CineScene, a framework that leverages implicit 3D-aware scene representation for cinematic video generation. Our key innovation is a novel context conditioning mechanism that injects 3D-aware features in an implicit way: By encoding scene images into visual representations through VGGT, CineScene injects spatial priors into a pretrained text-to-video generation model by additional…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

KlingTeam/Scene-Decoupled-Video-dataset
dataset· 176 dl
176 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Vision and Imaging · 3D Shape Modeling and Analysis