DriveScape: Towards High-Resolution Controllable Multi-View Driving   Video Generation

Wei Wu; Xi Guo; Weixuan Tang; Tingxuan Huang; Chiyu Wang; Dongyue; Chen; Chenjing Ding

arXiv:2409.05463·cs.CV·September 13, 2024

DriveScape: Towards High-Resolution Controllable Multi-View Driving Video Generation

Wei Wu, Xi Guo, Weixuan Tang, Tingxuan Huang, Chiyu Wang, Dongyue, Chen, Chenjing Ding

PDF

Open Access

TL;DR

DriveScape is an innovative framework that generates high-resolution, multi-view driving videos with precise 3D control, overcoming previous limitations in spatial-temporal consistency and frame rate, and achieving state-of-the-art results.

Contribution

It introduces DriveScape, a novel end-to-end 3D-guided video generation model with a Bi-Directional Modulated Transformer for high-resolution, multi-view driving videos.

Findings

01

Achieves 1024x576 resolution at 10Hz.

02

Outperforms existing methods with an FID of 8.34.

03

Maintains spatial-temporal consistency in multi-view videos.

Abstract

Recent advancements in generative models have provided promising solutions for synthesizing realistic driving videos, which are crucial for training autonomous driving perception models. However, existing approaches often struggle with multi-view video generation due to the challenges of integrating 3D information while maintaining spatial-temporal consistency and effectively learning from a unified model. We propose DriveScape, an end-to-end framework for multi-view, 3D condition-guided video generation, capable of producing 1024 x 576 high-resolution videos at 10Hz. Unlike other methods limited to 2Hz due to the 3D box annotation frame rate, DriveScape overcomes this with its ability to operate under sparse conditions. Our Bi-Directional Modulated Transformer (BiMot) ensures precise alignment of 3D structural information, maintaining spatial-temporal consistency. DriveScape excels in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Computer Graphics and Visualization Techniques · Video Coding and Compression Technologies

MethodsAttention Is All You Need · Byte Pair Encoding · Absolute Position Encodings · Softmax · Label Smoothing · Dropout · Layer Normalization · Position-Wise Feed-Forward Layer · Linear Layer · Adam