CoVoMix2: Advancing Zero-Shot Dialogue Generation with Fully Non-Autoregressive Flow Matching

Leying Zhang; Yao Qian; Xiaofei Wang; Manthan Thakker; Dongmei Wang; Jianwei Yu; Haibin Wu; Yuxuan Hu; Jinyu Li; Yanmin Qian; Sheng Zhao

arXiv:2506.00885·cs.SD·October 21, 2025

CoVoMix2: Advancing Zero-Shot Dialogue Generation with Fully Non-Autoregressive Flow Matching

Leying Zhang, Yao Qian, Xiaofei Wang, Manthan Thakker, Dongmei Wang, Jianwei Yu, Haibin Wu, Yuxuan Hu, Jinyu Li, Yanmin Qian, Sheng Zhao

PDF

Open Access 1 Video

TL;DR

CoVoMix2 introduces a fully non-autoregressive flow-matching model for zero-shot multi-talker dialogue generation, achieving high speech quality, speaker consistency, and fast inference without relying on intermediate token representations.

Contribution

It presents a novel non-autoregressive framework that directly predicts mel-spectrograms from transcriptions, incorporating speaker disentanglement and masking strategies for improved dialogue synthesis.

Findings

01

Outperforms MoonCast and Sesame in quality and speed

02

Operates without transcriptions for prompts

03

Supports overlapping speech and timing control

Abstract

Generating natural-sounding, multi-speaker dialogue is crucial for applications such as podcast creation, virtual agents, and multimedia content generation. However, existing systems struggle to maintain speaker consistency, model overlapping speech, and synthesize coherent conversations efficiently. In this paper, we introduce CoVoMix2, a fully non-autoregressive framework for zero-shot multi-talker dialogue generation. CoVoMix2 directly predicts mel-spectrograms from multi-stream transcriptions using a flow-matching-based generative model, eliminating the reliance on intermediate token representations. To better capture realistic conversational dynamics, we propose transcription-level speaker disentanglement, sentence-level alignment, and prompt-level random masking strategies. Our approach achieves state-of-the-art performance, outperforming strong baselines like MoonCast and Sesame…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

CoVoMix2: Advancing Zero-Shot Dialogue Generation with Fully Non-Autoregressive Flow Matching· slideslive

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Speech and dialogue systems