Voxtral TTS
Mistral-AI: Alexander H. Liu, Alexis Tacnet, Andy Ehrenberg, Andy Lo, Chen-Yo Sun, Guillaume Lample, Henry Lagarde, Jean-Malo Delignon, Jaeyoung Kim, John Harvill, Khyathi Raghavi Chandu, Lorenzo Signoretti, Margaret Jennings, Patrick von Platen, Pavankumar Reddy Muddireddy

TL;DR
Voxtral TTS is a novel multilingual text-to-speech model that produces natural, expressive speech from minimal reference audio, utilizing a hybrid architecture and a custom speech tokenizer.
Contribution
It introduces a hybrid auto-regressive and flow-matching architecture with a new speech tokenizer trained from scratch, enabling high-quality multilingual voice cloning from limited data.
Findings
Voxtral TTS outperforms ElevenLabs Flash v2.5 in human evaluations for naturalness and expressivity.
The model achieves a 68.4% win rate in preference tests.
It can generate expressive speech from as little as 3 seconds of reference audio.
Abstract
We introduce Voxtral TTS, an expressive multilingual text-to-speech model that generates natural speech from as little as 3 seconds of reference audio. Voxtral TTS adopts a hybrid architecture that combines auto-regressive generation of semantic speech tokens with flow-matching for acoustic tokens. These tokens are encoded and decoded with Voxtral Codec, a speech tokenizer trained from scratch with a hybrid VQ-FSQ quantization scheme. In human evaluations conducted by native speakers, Voxtral TTS is preferred for multilingual voice cloning due to its naturalness and expressivity, achieving a 68.4\% win rate over ElevenLabs Flash v2.5. We release the model weights under a CC BY-NC license.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗mistralai/Voxtral-4B-TTS-2603model· 5.7k dl· ♡ 8075.7k dl♡ 807
- 🤗servantofares/Voxtral-4B-TTS-2603model· 5 dl5 dl
- 🤗GotTravisOnMyBack/Voxtral-4B-TTSmodel· 3 dl3 dl
- 🤗YadiHuMDev/Voxtral-4B-TTS-2603model· 2 dl2 dl
- 🤗Manish993135/Voxtral-4B-TTS-2603model· 1 dl1 dl
- 🤗Eco9689/VoiceForgemodel
- 🤗riasatsk/Voxtral-4B-TTS-2603model
- 🤗xiyef/Voxtral-4B-TTS-2603model
- 🤗LittleDesignSolution/Voxtral-4B-TTS-2603model· 3 dl3 dl
- 🤗mainline777/Voxtral-4B-TTS-2603model· 12 dl12 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
