Improving Joint Audio-Video Generation with Cross-Modal Context Learning
Bingqi Ma, Linlong Lang, Ming Zhang, Dailan He, Xingtong Ge, Yi Zhang, Guanglu Song, and Yu Liu

TL;DR
This paper introduces Cross-Modal Context Learning (CCL), a novel approach that enhances joint audio-video generation by addressing limitations in existing dual-stream transformer models, leading to higher quality, synchronized content with fewer resources.
Contribution
The paper proposes CCL with modules like TARP, LCT, DCR, and UCG to improve temporal alignment, stability, and training-inference consistency in audio-video generation models.
Findings
Achieves state-of-the-art performance on audio-video generation tasks.
Requires fewer resources compared to recent methods.
Improves temporal alignment and cross-modal consistency.
Abstract
The dual-stream transformer architecture-based joint audio-video generation method has become the dominant paradigm in current research. By incorporating pre-trained video diffusion models and audio diffusion models, along with a cross-modal interaction attention module, high-quality, temporally synchronized audio-video content can be generated with minimal training data. In this paper, we first revisit the dual-stream transformer paradigm and further analyze its limitations, including model manifold variations caused by the gating mechanism controlling cross-modal interactions, biases in multi-modal background regions introduced by cross-modal attention, and the inconsistencies in multi-modal classifier-free guidance (CFG) during training and inference, as well as conflicts between multiple conditions. To alleviate these issues, we propose Cross-Modal Context Learning (CCL), equipped…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Speech and Audio Processing · Music and Audio Processing
