ZipVoice-Dialog: Non-Autoregressive Spoken Dialogue Generation with Flow Matching

Han Zhu; Wei Kang; Liyong Guo; Zengwei Yao; Fangjun Kuang; Weiji Zhuang; Zhaoqing Li; Zhifeng Han; Dong Zhang; Xin Zhang; Xingchen Song; Lingxuan Ye; Long Lin; Daniel Povey

arXiv:2507.09318·eess.AS·April 15, 2026

ZipVoice-Dialog: Non-Autoregressive Spoken Dialogue Generation with Flow Matching

Han Zhu, Wei Kang, Liyong Guo, Zengwei Yao, Fangjun Kuang, Weiji Zhuang, Zhaoqing Li, Zhifeng Han, Dong Zhang, Xin Zhang, Xingchen Song, Lingxuan Ye, Long Lin, Daniel Povey

PDF

1 Repo 3 Models 2 Datasets

TL;DR

ZipVoice-Dialog introduces a non-autoregressive flow-matching model for spoken dialogue generation, enhancing speed, stability, and speaker turn accuracy, supported by a new large-scale dataset and evaluation benchmark.

Contribution

The paper presents a novel non-autoregressive flow-matching approach for dialogue generation, along with curriculum learning, speaker-turn embeddings, and a large open dataset for training and evaluation.

Findings

01

ZipVoice-Dialog achieves faster inference and higher speech intelligibility.

02

The model demonstrates improved speaker turn-taking accuracy.

03

The OpenDialog dataset enables comprehensive benchmarking of dialogue models.

Abstract

Generating spoken dialogue is inherently more complex than monologue text-to-speech (TTS), as it demands both realistic turn-taking and the maintenance of distinct speaker timbres. While existing autoregressive (AR) models have made progress, they often suffer from high inference latency and stability issues. To overcome these limitations, we propose ZipVoice-Dialog, a non-autoregressive (NAR) zero-shot spoken dialogue generation model based on flow-matching. Observing that applying vanilla flow-matching to dialogue generation leads to poor speech intelligibility and turn-taking precision, we introduce two simple yet effective methods to adapt flow-matching architectures for dialogue generation: (1) a curriculum learning strategy to ensure robust speech-text alignment, and (2) speaker-turn embeddings to govern precise speaker turn-taking. Additionally, we introduce dedicated strategies…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

k2-fsa/ZipVoice
github

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.