Fish-Speech: Leveraging Large Language Models for Advanced Multilingual   Text-to-Speech Synthesis

Shijia Liao; Yuxuan Wang; Tianyu Li; Yifan Cheng; Ruoyi Zhang; Rongzhi; Zhou; Yijin Xing

arXiv:2411.01156·cs.SD·November 12, 2024

Fish-Speech: Leveraging Large Language Models for Advanced Multilingual Text-to-Speech Synthesis

Shijia Liao, Yuxuan Wang, Tianyu Li, Yifan Cheng, Ruoyi Zhang, Rongzhi, Zhou, Yijin Xing

PDF

Open Access 1 Repo 10 Models

TL;DR

Fish-Speech introduces a novel multilingual TTS framework using large language models and a dual autoregressive architecture to improve naturalness, efficiency, and multilingual support in speech synthesis.

Contribution

The paper presents Fish-Speech, a new TTS system that leverages LLMs for linguistic features and a dual-AR architecture to enhance stability and quality, streamlining multilingual speech synthesis.

Findings

01

Outperforms baseline models in complex linguistic scenarios

02

Achieves near 100% codebook utilization with FF-GAN

03

Enhances multilingual and voice cloning capabilities

Abstract

Text-to-Speech (TTS) systems face ongoing challenges in processing complex linguistic features, handling polyphonic expressions, and producing natural-sounding multilingual speech - capabilities that are crucial for future AI applications. In this paper, we present Fish-Speech, a novel framework that implements a serial fast-slow Dual Autoregressive (Dual-AR) architecture to enhance the stability of Grouped Finite Scalar Vector Quantization (GFSQ) in sequence generation tasks. This architecture improves codebook processing efficiency while maintaining high-fidelity outputs, making it particularly effective for AI interactions and voice cloning. Fish-Speech leverages Large Language Models (LLMs) for linguistic feature extraction, eliminating the need for traditional grapheme-to-phoneme (G2P) conversion and thereby streamlining the synthesis pipeline and enhancing multilingual support.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

fishaudio/fish-speech
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling