Lighting-grounded Video Generation with Renderer-based Agent Reasoning
Ziqi Cai, Taoyu Yang, Zheng Chang, Si Li, Han Jiang, Shuchen Weng, Boxin Shi

TL;DR
LiVER is a diffusion-based framework that enables explicit control over scene factors like layout, lighting, and camera parameters in video generation, supported by a new dataset and a scene agent for user instructions.
Contribution
The paper introduces LiVER, a novel scene-controllable video generation framework that conditions on explicit 3D scene properties and includes a new dataset and scene agent for improved control.
Findings
LiVER achieves state-of-the-art photorealism and temporal consistency.
It enables precise, disentangled control over scene factors.
The framework supports fully editable 3D scene-based video synthesis.
Abstract
Diffusion models have achieved remarkable progress in video generation, but their controllability remains a major limitation. Key scene factors such as layout, lighting, and camera trajectory are often entangled or only weakly modeled, restricting their applicability in domains like filmmaking and virtual production where explicit scene control is essential. We present LiVER, a diffusion-based framework for scene-controllable video generation. To achieve this, we introduce a novel framework that conditions video synthesis on explicit 3D scene properties, supported by a new large-scale dataset with dense annotations of object layout, lighting, and camera parameters. Our method disentangles these properties by rendering control signals from a unified 3D representation. We propose a lightweight conditioning module and a progressive training strategy to integrate these signals into a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
