Improving Joint Audio-Video Generation with Cross-Modal Context Learning

Bingqi Ma; Linlong Lang; Ming Zhang; Dailan He; Xingtong Ge; Yi Zhang; Guanglu Song; and Yu Liu

arXiv:2603.18600·cs.CV·March 20, 2026

Improving Joint Audio-Video Generation with Cross-Modal Context Learning

Bingqi Ma, Linlong Lang, Ming Zhang, Dailan He, Xingtong Ge, Yi Zhang, Guanglu Song, and Yu Liu

PDF

Open Access

TL;DR

This paper introduces Cross-Modal Context Learning (CCL), a novel approach that enhances joint audio-video generation by addressing limitations in existing dual-stream transformer models, leading to higher quality, synchronized content with fewer resources.

Contribution

The paper proposes CCL with modules like TARP, LCT, DCR, and UCG to improve temporal alignment, stability, and training-inference consistency in audio-video generation models.

Findings

01

Achieves state-of-the-art performance on audio-video generation tasks.

02

Requires fewer resources compared to recent methods.

03

Improves temporal alignment and cross-modal consistency.

Abstract

The dual-stream transformer architecture-based joint audio-video generation method has become the dominant paradigm in current research. By incorporating pre-trained video diffusion models and audio diffusion models, along with a cross-modal interaction attention module, high-quality, temporally synchronized audio-video content can be generated with minimal training data. In this paper, we first revisit the dual-stream transformer paradigm and further analyze its limitations, including model manifold variations caused by the gating mechanism controlling cross-modal interactions, biases in multi-modal background regions introduced by cross-modal attention, and the inconsistencies in multi-modal classifier-free guidance (CFG) during training and inference, as well as conflicts between multiple conditions. To alleviate these issues, we propose Cross-Modal Context Learning (CCL), equipped…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Speech and Audio Processing · Music and Audio Processing