MM-Sonate: Multimodal Controllable Audio-Video Generation with Zero-Shot Voice Cloning

Chunyu Qiang; Jun Wang; Xiaopeng Wang; Kang Yin; Yuxin Guo

arXiv:2601.01568·cs.SD·January 9, 2026

MM-Sonate: Multimodal Controllable Audio-Video Generation with Zero-Shot Voice Cloning

Chunyu Qiang, Jun Wang, Xiaopeng Wang, Kang Yin, Yuxin Guo

PDF

Open Access

TL;DR

MM-Sonate is a novel multimodal framework that enables synchronized audio-video generation with zero-shot voice cloning, achieving high fidelity and precise control through a unified instruction-phoneme input and innovative noise conditioning.

Contribution

The paper introduces MM-Sonate, a unified flow-based model that combines controllable audio-video synthesis with zero-shot voice cloning, addressing temporal misalignment and fidelity issues of prior methods.

Findings

01

Sets new state-of-the-art in joint audio-video generation benchmarks.

02

Achieves lip synchronization and speech intelligibility improvements.

03

Demonstrates voice cloning quality comparable to specialized TTS systems.

Abstract

Joint audio-video generation aims to synthesize synchronized multisensory content, yet current unified models struggle with fine-grained acoustic control, particularly for identity-preserving speech. Existing approaches either suffer from temporal misalignment due to cascaded generation or lack the capability to perform zero-shot voice cloning within a joint synthesis framework. In this work, we present MM-Sonate, a multimodal flow-matching framework that unifies controllable audio-video joint generation with zero-shot voice cloning capabilities. Unlike prior works that rely on coarse semantic descriptions, MM-Sonate utilizes a unified instruction-phoneme input to enforce strict linguistic and temporal alignment. To enable zero-shot voice cloning, we introduce a timbre injection mechanism that effectively decouples speaker identity from linguistic content. Furthermore, addressing the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Face recognition and analysis · Music Technology and Sound Studies