Harmony: Harmonizing Audio and Video Generation through Cross-Task Synergy

Teng Hu; Zhentao Yu; Guozhen Zhang; Zihan Su; Zhengguang Zhou; Youliang Zhang; Yuan Zhou; Qinglin Lu; Ran Yi

arXiv:2511.21579·cs.CV·December 1, 2025

Harmony: Harmonizing Audio and Video Generation through Cross-Task Synergy

Teng Hu, Zhentao Yu, Guozhen Zhang, Zihan Su, Zhengguang Zhou, Youliang Zhang, Yuan Zhou, Qinglin Lu, Ran Yi

PDF

Open Access

TL;DR

Harmony introduces a comprehensive framework that significantly improves synchronized audio-visual content generation by addressing core challenges in joint diffusion processes through innovative training, alignment modules, and guidance techniques.

Contribution

The paper presents a novel framework with cross-task training, a global-local alignment module, and a synchronization-enhanced guidance method to improve audio-visual synchronization in generative models.

Findings

01

Achieves state-of-the-art synchronization accuracy.

02

Outperforms existing methods in generation fidelity.

03

Effectively mitigates correspondence drift and improves temporal alignment.

Abstract

The synthesis of synchronized audio-visual content is a key challenge in generative AI, with open-source models facing challenges in robust audio-video alignment. Our analysis reveals that this issue is rooted in three fundamental challenges of the joint diffusion process: (1) Correspondence Drift, where concurrently evolving noisy latents impede stable learning of alignment; (2) inefficient global attention mechanisms that fail to capture fine-grained temporal cues; and (3) the intra-modal bias of conventional Classifier-Free Guidance (CFG), which enhances conditionality but not cross-modal synchronization. To overcome these challenges, we introduce Harmony, a novel framework that mechanistically enforces audio-visual synchronization. We first propose a Cross-Task Synergy training paradigm to mitigate drift by leveraging strong supervisory signals from audio-driven video and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Music Technology and Sound Studies · Speech and Audio Processing