DiVE: DiT-based Video Generation with Enhanced Control
Junpeng Jiang, Gangyi Hong, Lijun Zhou, Enhui Ma, Hengtong Hu, Xia, Zhou, Jie Xiang, Fan Liu, Kaicheng Yu, Haiyang Sun, Kun Zhan, Peng Jia, Miao, Zhang

TL;DR
This paper introduces DiVE, a novel DiT-based framework for generating high-quality, temporally and multi-view consistent videos in autonomous driving scenarios, with precise control matching bird's-eye view layouts.
Contribution
It is the first DiT-based method specifically designed for multi-view, controllable video generation with cross-view consistency mechanisms.
Findings
Effective in producing long, controllable videos in challenging corner cases.
Outperforms existing methods in qualitative assessments on nuScenes dataset.
Ensures cross-view consistency through a novel spatial view-inflated attention mechanism.
Abstract
Generating high-fidelity, temporally consistent videos in autonomous driving scenarios faces a significant challenge, e.g. problematic maneuvers in corner cases. Despite recent video generation works are proposed to tackcle the mentioned problem, i.e. models built on top of Diffusion Transformers (DiT), works are still missing which are targeted on exploring the potential for multi-view videos generation scenarios. Noticeably, we propose the first DiT-based framework specifically designed for generating temporally and multi-view consistent videos which precisely match the given bird's-eye view layouts control. Specifically, the proposed framework leverages a parameter-free spatial view-inflated attention mechanism to guarantee the cross-view consistency, where joint cross-attention modules and ControlNet-Transformer are integrated to further improve the precision of control. To…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging
MethodsSoftmax · Attention Is All You Need · Diffusion
