TL;DR
VibeVoice is a new speech synthesis model that uses diffusion and a novel tokenizer to generate long-form, multi-speaker speech efficiently, achieving high fidelity and extended duration.
Contribution
It introduces a continuous speech tokenizer that outperforms existing models in compression and efficiency, enabling long-duration, multi-speaker speech synthesis.
Findings
80x data compression improvement over Encodec
Supports up to 90-minute long-form speech
Synthesizes speech for up to 4 speakers
Abstract
This report presents VibeVoice, a novel model designed to synthesize long-form speech with multiple speakers by employing next-token diffusion, which is a unified method for modeling continuous data by autoregressively generating latent vectors via diffusion. To enable this, we introduce a novel continuous speech tokenizer that, when compared to the popular Encodec model, improves data compression by 80 times while maintaining comparable performance. The tokenizer effectively preserves audio fidelity while significantly boosting computational efficiency for processing long sequences. Thus, VibeVoice can synthesize long-form speech for up to 90 minutes (in a 64K context window length) with a maximum of 4 speakers, capturing the authentic conversational ``vibe'' and surpassing open-source and proprietary dialogue models.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗microsoft/VibeVoice-1.5Bmodel· 63k dl· ♡ 228163k dl♡ 2281
- 🤗microsoft/VibeVoice-Realtime-0.5Bmodel· 364k dl· ♡ 1163364k dl♡ 1163
- 🤗aoi-ot/VibeVoice-Largemodel· 2.8k dl· ♡ 2242.8k dl♡ 224
- 🤗vibevoice/VibeVoice-7Bmodel· 9.0k dl· ♡ 1749.0k dl♡ 174
- 🤗elbruno/VibeVoice-Realtime-0.5B-ONNXmodel· 99 dl· ♡ 399 dl♡ 3
- 🤗chaitnya26/VibeVoice-1.5b-forkmodel· 1 dl· ♡ 11 dl♡ 1
- 🤗sheliak/VibeVoice-Large_Mirrormodel· 893 dl· ♡ 6893 dl♡ 6
- 🤗newsletter/VibeVoice-Large-ptmodel· 3 dl· ♡ 23 dl♡ 2
- 🤗aoi-ot/VibeVoice-7Bmodel· 1.5k dl· ♡ 241.5k dl♡ 24
- 🤗Sergey004/VibeVoice-1.5Bmodel· 1 dl1 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
