TL;DR
This paper introduces a novel 3D Gaussian scene representation for fast, camera-controlled video generation from a single image, achieving state-of-the-art quality and efficiency.
Contribution
It proposes a new framework that constructs a 3D Gaussian scene and samples object motion in one pass, enabling fast, controllable video synthesis without iterative denoising.
Findings
Achieves state-of-the-art video quality on multiple datasets.
Enables fast inference without iterative denoising.
Provides precise camera control and coherent object motion.
Abstract
Humans excel at forecasting the future dynamics of a scene given just a single image. Video generation models that can mimic this ability are an essential component for intelligent systems. Recent approaches have improved temporal coherence and 3D consistency in single-image-conditioned video generation. However, these methods often lack robust user controllability, such as modifying the camera path, limiting their applicability in real-world applications. Most existing camera-controlled image-to-video models struggle with accurately modeling camera motion, maintaining temporal consistency, and preserving geometric integrity. Leveraging explicit intermediate 3D representations offers a promising solution by enabling coherent video generation aligned with a given camera trajectory. Although these methods often use 3D point clouds to render scenes and introduce object motion in a later…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
