Controlling Space and Time with Diffusion Models

Daniel Watson; Saurabh Saxena; Lala Li; Andrea Tagliasacchi; David J.; Fleet

arXiv:2407.07860·cs.CV·April 22, 2025

Controlling Space and Time with Diffusion Models

Daniel Watson, Saurabh Saxena, Lala Li, Andrea Tagliasacchi, David J., Fleet

PDF

Open Access 1 Video 3 Reviews

TL;DR

This paper introduces 4DiM, a novel diffusion-based model for 4D view synthesis that supports arbitrary camera trajectories and timestamps, improving generalization and enabling dynamic scene generation from limited data.

Contribution

The paper presents the first NVS method with intuitive metric-scale camera control and a new architecture that trains on mixed 3D, 4D, and video data for better generalization.

Findings

01

Outperforms prior 3D NVS models in fidelity and pose accuracy

02

Enables scene dynamics and pose-controlled video generation

03

Supports various tasks like single-image 3D and video translation

Abstract

We present 4DiM, a cascaded diffusion model for 4D novel view synthesis (NVS), supporting generation with arbitrary camera trajectories and timestamps, in natural scenes, conditioned on one or more images. With a novel architecture and sampling procedure, we enable training on a mixture of 3D (with camera pose), 4D (pose+time) and video (time but no pose) data, which greatly improves generalization to unseen images and camera pose trajectories over prior works that focus on limited domains (e.g., object centric). 4DiM is the first-ever NVS method with intuitive metric-scale camera pose control enabled by our novel calibration pipeline for structure-from-motion-posed data. Experiments demonstrate that 4DiM outperforms prior 3D NVS models both in terms of image fidelity and pose alignment, while also enabling the generation of scene dynamics. 4DiM provides a general framework for a…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 4

Strengths

The task of unposed 4D novel view synthesis is a challenging one, and the paper presents a novel approach to tackle it. Compared to previous works, the proposed model has two strengths: - The sampling method is well designed. The multi-guidance sampling enables controls on images; camera poses, and timestamps. - The dataset combination enhances the generalization ability of the model. The authors also give experimental results to explain the effectiveness of the dataset. - The authors propose no

Weaknesses

Although the paper has several strengths, there are some weaknesses that need to be addressed: - The sampling resolution is not high enough. Compared with SOTA methods, ViewCrafter generates videos with higher resolution (576x1024), while the proposed method only generates 256x256 resolution videos even with a super resolution model. Is it possible to generate higher resolution videos with shorter sequence length or smaller training batch size? - The discussion of 360-degree videos is not enough

Reviewer 02Rating 8Confidence 5

Strengths

S1: General model 4DiM enables joint training with posed and unposed video data, which not only improves the generalization and output fidelity, but also allows joint camera pose and time control in 4D novel view synthesis. It is shown to perform well on various tasks such as image-to-3D, two-image-to-video (interpolation and extrapolation), and pose-conditioned video-to-video translation. S2: Better 3D consistency and pose alignment Both qualitative and quantitative results show a consistent i

Weaknesses

W1: Motion realism and temporal consistency While the model shows great 3D consistency and pose alignment, the fidelity and temporal consistency of dynamic objects seems less impressive. In most dynamic scene results, the moving objects (cars, tires, animals, etc) either have unrealistic motion or temporal artifacts. This can probably be improved by including dynamic object data like Objaverse for training. It would also be good to show more results on single-image-to-4D. W2: Cartoonish texture

Reviewer 03Rating 6Confidence 4

Strengths

- They introduce masked FiLM layers, which avoids misusing 0 conditions (1) during conditioning signal dropout and (2) due to missing data. - They introduce cRealEstate10K dataset to unify the scale of camera extrinsics, which reduces ambiguities during training and leads to more reliable camera control. - The qualitative results are promising, showing good disentanglement between camera and time. - The quantitative result is better than existing methods on RE10k/LLFF.

Weaknesses

- It's unclear whether the problem should be called 4D NVS. One common interpretation of 4D is 3D+time. Under this interpretation, I don't think the proposed model can do 4D. With camera control, it can do 3D. With time control, it can do 2D+time. However, combining both is not equivalent to 3D+time. 3D+time would require generating multiple videos that are synchronized in time, or multiple "frozen" time videos with different t. - Data: The model is trained on mostly static data. The only source

Videos

Controlling Space and Time with Diffusion Models· slideslive

Taxonomy

TopicsMatrix Theory and Algorithms

MethodsFocus · Sparse Evolutionary Training · Diffusion