In-Context Audio Control of Video Diffusion Transformers
Wenze Liu, Weicai Ye, Minghong Cai, Quande Liu, Xintao Wang, Xiangyu Yue

TL;DR
This paper introduces a novel framework for integrating audio signals into video diffusion transformers, enabling speech-driven video generation with improved lip synchronization and quality through a masked 3D attention mechanism.
Contribution
It proposes a new method for audio-visual integration in transformer-based video generation, including a masked 3D attention mechanism for stable training and better synchronization.
Findings
3D attention captures spatio-temporal audio-visual correlations effectively.
Masked 3D attention enables stable training and enhances lip synchronization.
The approach achieves high-quality, speech-driven video generation conditioned on audio and reference images.
Abstract
Recent advancements in video generation have seen a shift towards unified, transformer-based foundation models that can handle multiple conditional inputs in-context. However, these models have primarily focused on modalities like text, images, and depth maps, while strictly time-synchronous signals like audio have been underexplored. This paper introduces In-Context Audio Control of video diffusion transformers (ICAC), a framework that investigates the integration of audio signals for speech-driven video generation within a unified full-attention architecture, akin to FullDiT. We systematically explore three distinct mechanisms for injecting audio conditions: standard cross-attention, 2D self-attention, and unified 3D self-attention. Our findings reveal that while 3D attention offers the highest potential for capturing spatio-temporal audio-visual correlations, it presents significant…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Generative Adversarial Networks and Image Synthesis · Hearing Loss and Rehabilitation
