ZipVoice: Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching

Han Zhu; Wei Kang; Zengwei Yao; Liyong Guo; Fangjun Kuang; Zhaoqing Li; Weiji Zhuang; Long Lin; Daniel Povey

arXiv:2506.13053·eess.AS·August 8, 2025

ZipVoice: Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching

Han Zhu, Wei Kang, Zengwei Yao, Liyong Guo, Fangjun Kuang, Zhaoqing Li, Weiji Zhuang, Long Lin, Daniel Povey

PDF

Open Access 1 Repo 4 Models

TL;DR

ZipVoice is a novel zero-shot text-to-speech model that achieves high speech quality with significantly faster inference and smaller size by using flow matching and innovative model design.

Contribution

The paper introduces ZipVoice, a flow-matching-based zero-shot TTS model with a compact size, fast inference, and novel components like Zipformer-based estimators and flow distillation.

Findings

01

Matches state-of-the-art speech quality

02

Three times smaller model size

03

Up to 30 times faster inference

Abstract

Existing large-scale zero-shot text-to-speech (TTS) models deliver high speech quality but suffer from slow inference speeds due to massive parameters. To address this issue, this paper introduces ZipVoice, a high-quality flow-matching-based zero-shot TTS model with a compact model size and fast inference speed. Key designs include: 1) a Zipformer-based vector field estimator to maintain adequate modeling capabilities under constrained size; 2) Average upsampling-based initial speech-text alignment and Zipformer-based text encoder to improve speech intelligibility; 3) A flow distillation method to reduce sampling steps and eliminate the inference overhead associated with classifier-free guidance. Experiments on 100k hours multilingual datasets show that ZipVoice matches state-of-the-art models in speech quality, while being 3 times smaller and up to 30 times faster than a DiT-based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

k2-fsa/ZipVoice
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling