Seeing Beyond Views: Multi-View Driving Scene Video Generation with   Holistic Attention

Hannan Lu; Xiaohe Wu; Shudong Wang; Xiameng Qin; Xinyu Zhang; Junyu; Han; Wangmeng Zuo; Ji Tao

arXiv:2412.03520·cs.CV·December 10, 2024

Seeing Beyond Views: Multi-View Driving Scene Video Generation with Holistic Attention

Hannan Lu, Xiaohe Wu, Shudong Wang, Xiameng Qin, Xinyu Zhang, Junyu, Han, Wangmeng Zuo, Ji Tao

PDF

Open Access

TL;DR

CogDriving is a novel diffusion transformer-based network that generates high-quality multi-view driving videos by capturing holistic spatio-temporal-view associations, with a lightweight controller and dynamic object learning for improved realism.

Contribution

The paper introduces CogDriving, a diffusion transformer architecture with holistic 4D attention and a lightweight controller, advancing multi-view driving video synthesis with better consistency and control.

Findings

01

Achieves an FVD score of 37.8 on nuScenes validation set.

02

Outperforms existing methods in multi-view video generation quality.

03

Demonstrates effective control over Bird's-Eye-View layouts.

Abstract

Generating multi-view videos for autonomous driving training has recently gained much attention, with the challenge of addressing both cross-view and cross-frame consistency. Existing methods typically apply decoupled attention mechanisms for spatial, temporal, and view dimensions. However, these approaches often struggle to maintain consistency across dimensions, particularly when handling fast-moving objects that appear at different times and viewpoints. In this paper, we present CogDriving, a novel network designed for synthesizing high-quality multi-view driving videos. CogDriving leverages a Diffusion Transformer architecture with holistic-4D attention modules, enabling simultaneous associations across the spatial, temporal, and viewpoint dimensions. We also propose a lightweight controller tailored for CogDriving, i.e., Micro-Controller, which uses only 1.1% of the parameters of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Computer Graphics and Visualization Techniques · Generative Adversarial Networks and Image Synthesis

MethodsAbsolute Position Encodings · Residual Connection · Adam · Attention Is All You Need · Softmax · Label Smoothing · Dropout · Dense Connections · Layer Normalization · Diffusion