In-Context Audio Control of Video Diffusion Transformers

Wenze Liu; Weicai Ye; Minghong Cai; Quande Liu; Xintao Wang; Xiangyu Yue

arXiv:2512.18772·cs.CV·December 23, 2025

In-Context Audio Control of Video Diffusion Transformers

Wenze Liu, Weicai Ye, Minghong Cai, Quande Liu, Xintao Wang, Xiangyu Yue

PDF

Open Access

TL;DR

This paper introduces a novel framework for integrating audio signals into video diffusion transformers, enabling speech-driven video generation with improved lip synchronization and quality through a masked 3D attention mechanism.

Contribution

It proposes a new method for audio-visual integration in transformer-based video generation, including a masked 3D attention mechanism for stable training and better synchronization.

Findings

01

3D attention captures spatio-temporal audio-visual correlations effectively.

02

Masked 3D attention enables stable training and enhances lip synchronization.

03

The approach achieves high-quality, speech-driven video generation conditioned on audio and reference images.

Abstract

Recent advancements in video generation have seen a shift towards unified, transformer-based foundation models that can handle multiple conditional inputs in-context. However, these models have primarily focused on modalities like text, images, and depth maps, while strictly time-synchronous signals like audio have been underexplored. This paper introduces In-Context Audio Control of video diffusion transformers (ICAC), a framework that investigates the integration of audio signals for speech-driven video generation within a unified full-attention architecture, akin to FullDiT. We systematically explore three distinct mechanisms for injecting audio conditions: standard cross-attention, 2D self-attention, and unified 3D self-attention. Our findings reveal that while 3D attention offers the highest potential for capturing spatio-temporal audio-visual correlations, it presents significant…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Generative Adversarial Networks and Image Synthesis · Hearing Loss and Rehabilitation