Controlling Space and Time with Diffusion Models
Daniel Watson, Saurabh Saxena, Lala Li, Andrea Tagliasacchi, David J., Fleet

TL;DR
This paper introduces 4DiM, a novel diffusion-based model for 4D view synthesis that supports arbitrary camera trajectories and timestamps, improving generalization and enabling dynamic scene generation from limited data.
Contribution
The paper presents the first NVS method with intuitive metric-scale camera control and a new architecture that trains on mixed 3D, 4D, and video data for better generalization.
Findings
Outperforms prior 3D NVS models in fidelity and pose accuracy
Enables scene dynamics and pose-controlled video generation
Supports various tasks like single-image 3D and video translation
Abstract
We present 4DiM, a cascaded diffusion model for 4D novel view synthesis (NVS), supporting generation with arbitrary camera trajectories and timestamps, in natural scenes, conditioned on one or more images. With a novel architecture and sampling procedure, we enable training on a mixture of 3D (with camera pose), 4D (pose+time) and video (time but no pose) data, which greatly improves generalization to unseen images and camera pose trajectories over prior works that focus on limited domains (e.g., object centric). 4DiM is the first-ever NVS method with intuitive metric-scale camera pose control enabled by our novel calibration pipeline for structure-from-motion-posed data. Experiments demonstrate that 4DiM outperforms prior 3D NVS models both in terms of image fidelity and pose alignment, while also enabling the generation of scene dynamics. 4DiM provides a general framework for a…
Peer Reviews
Decision·ICLR 2025 Poster
The task of unposed 4D novel view synthesis is a challenging one, and the paper presents a novel approach to tackle it. Compared to previous works, the proposed model has two strengths: - The sampling method is well designed. The multi-guidance sampling enables controls on images; camera poses, and timestamps. - The dataset combination enhances the generalization ability of the model. The authors also give experimental results to explain the effectiveness of the dataset. - The authors propose no
Although the paper has several strengths, there are some weaknesses that need to be addressed: - The sampling resolution is not high enough. Compared with SOTA methods, ViewCrafter generates videos with higher resolution (576x1024), while the proposed method only generates 256x256 resolution videos even with a super resolution model. Is it possible to generate higher resolution videos with shorter sequence length or smaller training batch size? - The discussion of 360-degree videos is not enough
S1: General model 4DiM enables joint training with posed and unposed video data, which not only improves the generalization and output fidelity, but also allows joint camera pose and time control in 4D novel view synthesis. It is shown to perform well on various tasks such as image-to-3D, two-image-to-video (interpolation and extrapolation), and pose-conditioned video-to-video translation. S2: Better 3D consistency and pose alignment Both qualitative and quantitative results show a consistent i
W1: Motion realism and temporal consistency While the model shows great 3D consistency and pose alignment, the fidelity and temporal consistency of dynamic objects seems less impressive. In most dynamic scene results, the moving objects (cars, tires, animals, etc) either have unrealistic motion or temporal artifacts. This can probably be improved by including dynamic object data like Objaverse for training. It would also be good to show more results on single-image-to-4D. W2: Cartoonish texture
- They introduce masked FiLM layers, which avoids misusing 0 conditions (1) during conditioning signal dropout and (2) due to missing data. - They introduce cRealEstate10K dataset to unify the scale of camera extrinsics, which reduces ambiguities during training and leads to more reliable camera control. - The qualitative results are promising, showing good disentanglement between camera and time. - The quantitative result is better than existing methods on RE10k/LLFF.
- It's unclear whether the problem should be called 4D NVS. One common interpretation of 4D is 3D+time. Under this interpretation, I don't think the proposed model can do 4D. With camera control, it can do 3D. With time control, it can do 2D+time. However, combining both is not equivalent to 3D+time. 3D+time would require generating multiple videos that are synchronized in time, or multiple "frozen" time videos with different t. - Data: The model is trained on mostly static data. The only source
Videos
Taxonomy
TopicsMatrix Theory and Algorithms
MethodsFocus · Sparse Evolutionary Training · Diffusion
