TL;DR
Diff4Splat is a fast, controllable 4D scene synthesis method from a single image, combining diffusion models with geometry and motion constraints for high-quality dynamic scene generation.
Contribution
It introduces a novel feed-forward approach that unifies diffusion priors with 4D geometry and motion learning, enabling efficient scene synthesis without optimization.
Findings
Synthesizes high-quality 4D scenes in 30 seconds
Matches or surpasses optimization-based methods in dynamic scene synthesis
Effective in video generation, view synthesis, and geometry extraction
Abstract
We introduce Diff4Splat, a feed-forward method that synthesizes controllable and explicit 4D scenes from a single image. Our approach unifies the generative priors of video diffusion models with geometry and motion constraints learned from large-scale 4D datasets. Given a single input image, a camera trajectory, and an optional text prompt, Diff4Splat directly predicts a deformable 3D Gaussian field that encodes appearance, geometry, and motion, all in a single forward pass, without test-time optimization or post-hoc refinement. At the core of our framework lies a video latent transformer, which augments video diffusion models to jointly capture spatio-temporal dependencies and predict time-varying 3D Gaussian primitives. Training is guided by objectives on appearance fidelity, geometric accuracy, and motion consistency, enabling Diff4Splat to synthesize high-quality 4D scenes in 30…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
