DriveScape: Towards High-Resolution Controllable Multi-View Driving Video Generation
Wei Wu, Xi Guo, Weixuan Tang, Tingxuan Huang, Chiyu Wang, Dongyue, Chen, Chenjing Ding

TL;DR
DriveScape is an innovative framework that generates high-resolution, multi-view driving videos with precise 3D control, overcoming previous limitations in spatial-temporal consistency and frame rate, and achieving state-of-the-art results.
Contribution
It introduces DriveScape, a novel end-to-end 3D-guided video generation model with a Bi-Directional Modulated Transformer for high-resolution, multi-view driving videos.
Findings
Achieves 1024x576 resolution at 10Hz.
Outperforms existing methods with an FID of 8.34.
Maintains spatial-temporal consistency in multi-view videos.
Abstract
Recent advancements in generative models have provided promising solutions for synthesizing realistic driving videos, which are crucial for training autonomous driving perception models. However, existing approaches often struggle with multi-view video generation due to the challenges of integrating 3D information while maintaining spatial-temporal consistency and effectively learning from a unified model. We propose DriveScape, an end-to-end framework for multi-view, 3D condition-guided video generation, capable of producing 1024 x 576 high-resolution videos at 10Hz. Unlike other methods limited to 2Hz due to the 3D box annotation frame rate, DriveScape overcomes this with its ability to operate under sparse conditions. Our Bi-Directional Modulated Transformer (BiMot) ensures precise alignment of 3D structural information, maintaining spatial-temporal consistency. DriveScape excels in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Computer Graphics and Visualization Techniques · Video Coding and Compression Technologies
MethodsAttention Is All You Need · Byte Pair Encoding · Absolute Position Encodings · Softmax · Label Smoothing · Dropout · Layer Normalization · Position-Wise Feed-Forward Layer · Linear Layer · Adam
