Fish-Speech: Leveraging Large Language Models for Advanced Multilingual Text-to-Speech Synthesis
Shijia Liao, Yuxuan Wang, Tianyu Li, Yifan Cheng, Ruoyi Zhang, Rongzhi, Zhou, Yijin Xing

TL;DR
Fish-Speech introduces a novel multilingual TTS framework using large language models and a dual autoregressive architecture to improve naturalness, efficiency, and multilingual support in speech synthesis.
Contribution
The paper presents Fish-Speech, a new TTS system that leverages LLMs for linguistic features and a dual-AR architecture to enhance stability and quality, streamlining multilingual speech synthesis.
Findings
Outperforms baseline models in complex linguistic scenarios
Achieves near 100% codebook utilization with FF-GAN
Enhances multilingual and voice cloning capabilities
Abstract
Text-to-Speech (TTS) systems face ongoing challenges in processing complex linguistic features, handling polyphonic expressions, and producing natural-sounding multilingual speech - capabilities that are crucial for future AI applications. In this paper, we present Fish-Speech, a novel framework that implements a serial fast-slow Dual Autoregressive (Dual-AR) architecture to enhance the stability of Grouped Finite Scalar Vector Quantization (GFSQ) in sequence generation tasks. This architecture improves codebook processing efficiency while maintaining high-fidelity outputs, making it particularly effective for AI interactions and voice cloning. Fish-Speech leverages Large Language Models (LLMs) for linguistic feature extraction, eliminating the need for traditional grapheme-to-phoneme (G2P) conversion and thereby streamlining the synthesis pipeline and enhancing multilingual support.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗fishaudio/fish-speech-1.5model· 7.2k dl· ♡ 7217.2k dl♡ 721
- 🤗fishaudio/fish-speech-1.4model· 755 dl· ♡ 457755 dl♡ 457
- 🤗cocktailpeanut/f15model· 2 dl2 dl
- 🤗jkeisling/fish-speech-1.5model· 48 dl· ♡ 148 dl♡ 1
- 🤗ModelsLab/fish-speech-1.5model· 33 dl· ♡ 333 dl♡ 3
- 🤗waynecraig/fish-speech-1.5-wuhanmodel· 1 dl1 dl
- 🤗alvarofranz/fish-speech-1.5model· 1 dl1 dl
- 🤗Gidigi/gidigi_a6665d09_0008model
- 🤗Smithke/fish-speech-1.5model
- 🤗macarrao12/fish-speech-1.5model· 4 dl4 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling
