OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models
Han Zhu, Lingxuan Ye, Wei Kang, Zengwei Yao, Liyong Guo, Fangjun Kuang, Zhifeng Han, Weiji Zhuang, Long Lin, Daniel Povey

TL;DR
OmniVoice is a novel zero-shot multilingual TTS model using a diffusion language model architecture, capable of synthesizing speech in over 600 languages with state-of-the-art quality.
Contribution
It introduces a direct text-to-acoustic token mapping with innovative training strategies, significantly expanding language coverage and improving TTS performance.
Findings
Achieves state-of-the-art results in Chinese, English, and multilingual benchmarks.
Supports over 600 languages with high-quality speech synthesis.
Utilizes a large open-source dataset of 581k hours for training.
Abstract
We present OmniVoice, a massively multilingual zero-shot text-to-speech (TTS) model that scales to over 600 languages. At its core is a novel diffusion language model-style discrete non-autoregressive (NAR) architecture. Unlike conventional discrete NAR models that suffer from performance bottlenecks in complex two-stage (text-to-semantic-to-acoustic) pipelines, OmniVoice directly maps text to multi-codebook acoustic tokens. This simplified approach is facilitated by two key technical innovations: (1) a full-codebook random masking strategy for efficient training, and (2) initialization from a pre-trained LLM to ensure superior intelligibility. By leveraging a 581k-hour multilingual dataset curated entirely from open-source data, OmniVoice achieves the broadest language coverage to date and delivers state-of-the-art performance across Chinese, English, and diverse multilingual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗kjanh/KhanhTTS-OmniVoicemodel· 4.3k dl· ♡ 304.3k dl♡ 30
- 🤗k2-fsa/OmniVoicemodel· 2.3M dl· ♡ 9162.3M dl♡ 916
- 🤗splendor1811/omnivoice-vietnamesemodel· 2.2k dl· ♡ 162.2k dl♡ 16
- 🤗Prince-1/OmniVoice-Onnxmodel· ♡ 1♡ 1
- 🤗k2-fsa/TTS_eval_modelsmodel· ♡ 3♡ 3
- 🤗edwixx/OmniVoicemodel· 721 dl· ♡ 1721 dl♡ 1
- 🤗drbaph/OmniVoice-bf16model· 2.6k dl· ♡ 202.6k dl♡ 20
- 🤗Bgeorge/OmniVoicemodel· 5 dl· ♡ 25 dl♡ 2
- 🤗WANJIAX2197/OmniVoicemodel· 4 dl4 dl
- 🤗Manish993135/OmniVoicemodel· 3 dl3 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
