MMAudio: Taming Multimodal Joint Training for High-Quality   Video-to-Audio Synthesis

Ho Kei Cheng; Masato Ishii; Akio Hayakawa; Takashi Shibuya; Alexander; Schwing; Yuki Mitsufuji

arXiv:2412.15322·cs.CV·April 9, 2025

MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis

Ho Kei Cheng, Masato Ishii, Akio Hayakawa, Takashi Shibuya, Alexander, Schwing, Yuki Mitsufuji

PDF

Open Access 1 Repo 2 Models 1 Datasets

TL;DR

MMAudio introduces a multimodal training framework that synthesizes high-quality, synchronized audio from video and text, outperforming existing models in quality and alignment while maintaining efficiency.

Contribution

The paper presents MMAudio, a novel joint training approach leveraging large-scale text-audio data to enhance video-to-audio synthesis and synchronization.

Findings

01

Achieves state-of-the-art video-to-audio quality and synchronization.

02

Maintains competitive text-to-audio performance.

03

Operates efficiently with low inference time and few parameters.

Abstract

We propose to synthesize high-quality and synchronized audio, given video and optional text conditions, using a novel multimodal joint training framework MMAudio. In contrast to single-modality training conditioned on (limited) video data only, MMAudio is jointly trained with larger-scale, readily available text-audio data to learn to generate semantically aligned high-quality audio samples. Additionally, we improve audio-visual synchrony with a conditional synchronization module that aligns video conditions with audio latents at the frame level. Trained with a flow matching objective, MMAudio achieves new video-to-audio state-of-the-art among public models in terms of audio quality, semantic alignment, and audio-visual synchronization, while having a low inference time (1.23s to generate an 8s clip) and just 157M parameters. MMAudio also achieves surprisingly competitive performance in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hkchengrex/MMAudio
pytorchOfficial

Models

Datasets

kaiw7/V2A-Sonic
dataset· 135 dl
135 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis