Model-Guided Dual-Role Alignment for High-Fidelity Open-Domain Video-to-Audio Generation

Kang Zhang; Trung X. Pham; Suyeon Lee; Axi Niu; Arda Senocak; Joon Son Chung

arXiv:2510.24103·cs.SD·October 29, 2025

Model-Guided Dual-Role Alignment for High-Fidelity Open-Domain Video-to-Audio Generation

Kang Zhang, Trung X. Pham, Suyeon Lee, Axi Niu, Arda Senocak, Joon Son Chung

PDF

TL;DR

MGAudio introduces a flow-based framework with model-guided dual-role alignment for high-fidelity open-domain video-to-audio generation, significantly improving coherence and realism over previous methods.

Contribution

The paper proposes a novel model-guided dual-role alignment mechanism within a flow-based Transformer framework for video-to-audio generation, outperforming existing guidance techniques.

Findings

01

Achieves state-of-the-art FAD of 0.40 on VGGSound

02

Outperforms classifier-free guidance baselines in quality metrics

03

Generalizes effectively to the UnAV-100 benchmark

Abstract

We present MGAudio, a novel flow-based framework for open-domain video-to-audio generation, which introduces model-guided dual-role alignment as a central design principle. Unlike prior approaches that rely on classifier-based or classifier-free guidance, MGAudio enables the generative model to guide itself through a dedicated training objective designed for video-conditioned audio generation. The framework integrates three main components: (1) a scalable flow-based Transformer model, (2) a dual-role alignment mechanism where the audio-visual encoder serves both as a conditioning module and as a feature aligner to improve generation quality, and (3) a model-guided objective that enhances cross-modal coherence and audio realism. MGAudio achieves state-of-the-art performance on VGGSound, reducing FAD to 0.40, substantially surpassing the best classifier-free guidance baselines, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.