VD3D: Taming Large Video Diffusion Transformers for 3D Camera Control

Sherwin Bahmani; Ivan Skorokhodov; Aliaksandr Siarohin; Willi; Menapace; Guocheng Qian; Michael Vasilkovsky; Hsin-Ying Lee; Chaoyang Wang,; Jiaxu Zou; Andrea Tagliasacchi; David B. Lindell; Sergey Tulyakov

arXiv:2407.12781·cs.CV·March 25, 2025

VD3D: Taming Large Video Diffusion Transformers for 3D Camera Control

Sherwin Bahmani, Ivan Skorokhodov, Aliaksandr Siarohin, Willi, Menapace, Guocheng Qian, Michael Vasilkovsky, Hsin-Ying Lee, Chaoyang Wang,, Jiaxu Zou, Andrea Tagliasacchi, David B. Lindell, Sergey Tulyakov

PDF

Open Access 1 Video 3 Reviews

TL;DR

This paper introduces a novel method for controlling camera movement in transformer-based video diffusion models, enabling fine-grained 3D camera control in generated videos, which was not possible with previous models.

Contribution

We propose a ControlNet-like conditioning mechanism with spatiotemporal camera embeddings for transformer-based video diffusion models, enabling controllable video generation.

Findings

01

Achieves state-of-the-art performance in controllable video generation.

02

First method to enable camera control in transformer-based video diffusion models.

03

Demonstrates effective fine-tuning on the RealEstate10K dataset.

Abstract

Modern text-to-video synthesis models demonstrate coherent, photorealistic generation of complex videos from a text description. However, most existing models lack fine-grained control over camera movement, which is critical for downstream applications related to content creation, visual effects, and 3D vision. Recently, new methods demonstrate the ability to generate videos with controllable camera poses these techniques leverage pre-trained U-Net-based diffusion models that explicitly disentangle spatial and temporal generation. Still, no existing approach enables camera control for new, transformer-based video diffusion models that process spatial and temporal information jointly. Here, we propose to tame video transformers for 3D camera control using a ControlNet-like conditioning mechanism that incorporates spatiotemporal camera embeddings based on Pl\"ucker coordinates. The…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 8Confidence 4

Strengths

Camera control during the video generation process is a significant issue. As more foundational models adopt transformer architectures, exploring control mechanisms for these models becomes crucial. This paper is the first to investigate how to better utilize camera trajectory parameters for transformer-based video generation models, using SnapVideo as the foundational model. The design is well thought out, and the evaluation is rigorous. The strengths of the paper are as follows: - Unlike the

Weaknesses

- While camera control is the central problem addressed in this paper, the camera trajectories used primarily come from the RealEstate10K dataset, which, as observed in the visual results, mostly follow smooth, straight lines. There is a lack of consideration and experimentation with trajectories of varying difficulty, such as those involving significant directional changes. This raises some questions regarding the trajectory settings. - There have been several prior works in the 3D multi-vie

Reviewer 02Rating 3Confidence 5

Strengths

- The proposed controlnet design outperforms other model variants designed by the authors. The evaluations are thoroughly conducted for the design choices. Detailed ablations are provided.

Weaknesses

- The proposed framework overfits on the trajectories that are seen during training. Though the authors provide quantitative comparisons in Tab. 8, no visual comparisons are provided. - Though the performance is impressive, the technical contribution is limited in the proposed framework. Training a ControlNet for diffusion transformer is not new, as shown in [1]. Using Plucker coordinates for camera control is not new, as shown in CameraCtrl (He et al., 2024a). [1] Chen J, Wu Y, Luo S, et al.

Reviewer 03Rating 6Confidence 5

Strengths

- paper is easy to follow - the proposed design including the Plücker embedding is reasonable and effective - comprehensive experiments are conducted and presented in the main manuscript and appendix - supplemental materials contain video samples to demonstrate the effectiveness

Weaknesses

- The proposed method has been evaluated only on one video diffusion transformer, which raises some concerns on whether its performance can generalize to other pretrained video diffusion transformers. - I'm curious about the distribution of camera movements evaluated in the experiments, in terms of its diversity and similarity to natural camera movements. - The novelty is slightly limited, as the task is not new, and ControlNet-like module as well as Plücker embedding have been explored and used

Videos

VD3D: Taming Large Video Diffusion Transformers for 3D Camera Control· slideslive

Taxonomy

TopicsCCD and CMOS Imaging Sensors · Advanced Optical Imaging Technologies · Advanced Optical Sensing Technologies

MethodsDiffusion