MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder

Bowen Zhang; Congchao Guo; Geng Yang; Hang Yu; Haozhe Zhang; Heidi Lei; Jialong Mai; Junjie Yan; Kaiyue Yang; Mingqi Yang; Peikai Huang; Ruiyang Jin; Sitan Jiang; Weihua Cheng; Yawei Li; Yichen Xiao; Yiying Zhou; Yongmao Zhang; Yuan Lu; Yucen He

arXiv:2505.07916·eess.AS·May 14, 2025

MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder

Bowen Zhang, Congchao Guo, Geng Yang, Hang Yu, Haozhe Zhang, Heidi Lei, Jialong Mai, Junjie Yan, Kaiyue Yang, Mingqi Yang, Peikai Huang, Ruiyang Jin, Sitan Jiang, Weihua Cheng, Yawei Li, Yichen Xiao, Yiying Zhou, Yongmao Zhang, Yuan Lu, Yucen He

PDF

3 Datasets

TL;DR

MiniMax-Speech is a Transformer-based TTS model with a learnable speaker encoder that enables high-quality, zero-shot voice cloning and expressive speech synthesis across 32 languages, achieving state-of-the-art results.

Contribution

The paper introduces a novel learnable speaker encoder and Flow-VAE that enhance zero-shot voice cloning and speech quality in a multilingual TTS system, with extensibility for various applications.

Findings

01

Achieves state-of-the-art voice cloning metrics.

02

Supports 32 languages with high-quality synthesis.

03

Enables versatile applications like emotion control and professional voice cloning.

Abstract

We introduce MiniMax-Speech, an autoregressive Transformer-based Text-to-Speech (TTS) model that generates high-quality speech. A key innovation is our learnable speaker encoder, which extracts timbre features from a reference audio without requiring its transcription. This enables MiniMax-Speech to produce highly expressive speech with timbre consistent with the reference in a zero-shot manner, while also supporting one-shot voice cloning with exceptionally high similarity to the reference voice. In addition, the overall quality of the synthesized audio is enhanced through the proposed Flow-VAE. Our model supports 32 languages and demonstrates excellent performance across multiple objective and subjective evaluations metrics. Notably, it achieves state-of-the-art (SOTA) results on objective voice cloning metrics (Word Error Rate and Speaker Similarity) and has secured the top position…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsBalanced Selection