MM-Sonate: Multimodal Controllable Audio-Video Generation with Zero-Shot Voice Cloning
Chunyu Qiang, Jun Wang, Xiaopeng Wang, Kang Yin, Yuxin Guo

TL;DR
MM-Sonate is a novel multimodal framework that enables synchronized audio-video generation with zero-shot voice cloning, achieving high fidelity and precise control through a unified instruction-phoneme input and innovative noise conditioning.
Contribution
The paper introduces MM-Sonate, a unified flow-based model that combines controllable audio-video synthesis with zero-shot voice cloning, addressing temporal misalignment and fidelity issues of prior methods.
Findings
Sets new state-of-the-art in joint audio-video generation benchmarks.
Achieves lip synchronization and speech intelligibility improvements.
Demonstrates voice cloning quality comparable to specialized TTS systems.
Abstract
Joint audio-video generation aims to synthesize synchronized multisensory content, yet current unified models struggle with fine-grained acoustic control, particularly for identity-preserving speech. Existing approaches either suffer from temporal misalignment due to cascaded generation or lack the capability to perform zero-shot voice cloning within a joint synthesis framework. In this work, we present MM-Sonate, a multimodal flow-matching framework that unifies controllable audio-video joint generation with zero-shot voice cloning capabilities. Unlike prior works that rely on coarse semantic descriptions, MM-Sonate utilizes a unified instruction-phoneme input to enforce strict linguistic and temporal alignment. To enable zero-shot voice cloning, we introduce a timbre injection mechanism that effectively decouples speaker identity from linguistic content. Furthermore, addressing the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Face recognition and analysis · Music Technology and Sound Studies
