Seeing Beyond Views: Multi-View Driving Scene Video Generation with Holistic Attention
Hannan Lu, Xiaohe Wu, Shudong Wang, Xiameng Qin, Xinyu Zhang, Junyu, Han, Wangmeng Zuo, Ji Tao

TL;DR
CogDriving is a novel diffusion transformer-based network that generates high-quality multi-view driving videos by capturing holistic spatio-temporal-view associations, with a lightweight controller and dynamic object learning for improved realism.
Contribution
The paper introduces CogDriving, a diffusion transformer architecture with holistic 4D attention and a lightweight controller, advancing multi-view driving video synthesis with better consistency and control.
Findings
Achieves an FVD score of 37.8 on nuScenes validation set.
Outperforms existing methods in multi-view video generation quality.
Demonstrates effective control over Bird's-Eye-View layouts.
Abstract
Generating multi-view videos for autonomous driving training has recently gained much attention, with the challenge of addressing both cross-view and cross-frame consistency. Existing methods typically apply decoupled attention mechanisms for spatial, temporal, and view dimensions. However, these approaches often struggle to maintain consistency across dimensions, particularly when handling fast-moving objects that appear at different times and viewpoints. In this paper, we present CogDriving, a novel network designed for synthesizing high-quality multi-view driving videos. CogDriving leverages a Diffusion Transformer architecture with holistic-4D attention modules, enabling simultaneous associations across the spatial, temporal, and viewpoint dimensions. We also propose a lightweight controller tailored for CogDriving, i.e., Micro-Controller, which uses only 1.1% of the parameters of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Computer Graphics and Visualization Techniques · Generative Adversarial Networks and Image Synthesis
MethodsAbsolute Position Encodings · Residual Connection · Adam · Attention Is All You Need · Softmax · Label Smoothing · Dropout · Dense Connections · Layer Normalization · Diffusion
