DiVE: DiT-based Video Generation with Enhanced Control

Junpeng Jiang; Gangyi Hong; Lijun Zhou; Enhui Ma; Hengtong Hu; Xia; Zhou; Jie Xiang; Fan Liu; Kaicheng Yu; Haiyang Sun; Kun Zhan; Peng Jia; Miao; Zhang

arXiv:2409.01595·cs.CV·September 4, 2024

DiVE: DiT-based Video Generation with Enhanced Control

Junpeng Jiang, Gangyi Hong, Lijun Zhou, Enhui Ma, Hengtong Hu, Xia, Zhou, Jie Xiang, Fan Liu, Kaicheng Yu, Haiyang Sun, Kun Zhan, Peng Jia, Miao, Zhang

PDF

Open Access

TL;DR

This paper introduces DiVE, a novel DiT-based framework for generating high-quality, temporally and multi-view consistent videos in autonomous driving scenarios, with precise control matching bird's-eye view layouts.

Contribution

It is the first DiT-based method specifically designed for multi-view, controllable video generation with cross-view consistency mechanisms.

Findings

01

Effective in producing long, controllable videos in challenging corner cases.

02

Outperforms existing methods in qualitative assessments on nuScenes dataset.

03

Ensures cross-view consistency through a novel spatial view-inflated attention mechanism.

Abstract

Generating high-fidelity, temporally consistent videos in autonomous driving scenarios faces a significant challenge, e.g. problematic maneuvers in corner cases. Despite recent video generation works are proposed to tackcle the mentioned problem, i.e. models built on top of Diffusion Transformers (DiT), works are still missing which are targeted on exploring the potential for multi-view videos generation scenarios. Noticeably, we propose the first DiT-based framework specifically designed for generating temporally and multi-view consistent videos which precisely match the given bird's-eye view layouts control. Specifically, the proposed framework leverages a parameter-free spatial view-inflated attention mechanism to guarantee the cross-view consistency, where joint cross-attention modules and ControlNet-Transformer are integrated to further improve the precision of control. To…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging

MethodsSoftmax · Attention Is All You Need · Diffusion