GS-DiT: Advancing Video Generation with Pseudo 4D Gaussian Fields   through Efficient Dense 3D Point Tracking

Weikang Bian; Zhaoyang Huang; Xiaoyu Shi; Yijin Li; Fu-Yun Wang,; Hongsheng Li

arXiv:2501.02690·cs.CV·January 7, 2025

GS-DiT: Advancing Video Generation with Pseudo 4D Gaussian Fields through Efficient Dense 3D Point Tracking

Weikang Bian, Zhaoyang Huang, Xiaoyu Shi, Yijin Li, Fu-Yun Wang,, Hongsheng Li

PDF

Open Access

TL;DR

GS-DiT introduces a novel pseudo 4D Gaussian field framework with dense 3D point tracking to enable controllable, multi-view video generation, significantly improving efficiency and flexibility over existing methods.

Contribution

The paper proposes a new pseudo 4D Gaussian field approach with an efficient dense 3D point tracking method, enhancing controllability and speed in video generation.

Findings

01

Outperforms state-of-the-art sparse 3D point tracking in accuracy and speed

02

Enables multi-view video generation with consistent dynamic content

03

Supports advanced cinematic effects through Gaussian field manipulation

Abstract

4D video control is essential in video generation as it enables the use of sophisticated lens techniques, such as multi-camera shooting and dolly zoom, which are currently unsupported by existing methods. Training a video Diffusion Transformer (DiT) directly to control 4D content requires expensive multi-view videos. Inspired by Monocular Dynamic novel View Synthesis (MDVS) that optimizes a 4D representation and renders videos according to different 4D elements, such as camera pose and object motion editing, we bring pseudo 4D Gaussian fields to video generation. Specifically, we propose a novel framework that constructs a pseudo 4D Gaussian field with dense 3D point tracking and renders the Gaussian field for all video frames. Then we finetune a pretrained DiT to generate videos following the guidance of the rendered video, dubbed as GS-DiT. To boost the training of the GS-DiT, we also…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Robotics and Sensor-Based Localization · 3D Shape Modeling and Analysis

MethodsAttention Is All You Need · Byte Pair Encoding · Dense Connections · Absolute Position Encodings · Dropout · Linear Layer · Softmax · Adam · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Residual Connection