SpeechGPT-Gen: Scaling Chain-of-Information Speech Generation
Dong Zhang, Xin Zhang, Jun Zhan, Shimin Li, Yaqian Zhou, Xipeng Qiu

TL;DR
SpeechGPT-Gen introduces a scalable speech generation model that effectively separates semantic and perceptual information, leading to improved zero-shot speech tasks and efficiency in large-scale speech modeling.
Contribution
It proposes Chain-of-Information Generation (CoIG) to decouple semantic and perceptual modeling, and develops SpeechGPT-Gen, an 8-billion-parameter model utilizing novel flow matching and semantic infusion techniques.
Findings
Outperforms in zero-shot text-to-speech and voice conversion
Demonstrates efficient semantic and perceptual information modeling
Achieves significant improvements in speech dialogue tasks
Abstract
Benefiting from effective speech modeling, current Speech Large Language Models (SLLMs) have demonstrated exceptional capabilities in in-context speech generation and efficient generalization to unseen speakers. However, the prevailing information modeling process is encumbered by certain redundancies, leading to inefficiencies in speech generation. We propose Chain-of-Information Generation (CoIG), a method for decoupling semantic and perceptual information in large-scale speech generation. Building on this, we develop SpeechGPT-Gen, an 8-billion-parameter SLLM efficient in semantic and perceptual information modeling. It comprises an autoregressive model based on LLM for semantic information modeling and a non-autoregressive model employing flow matching for perceptual information modeling. Additionally, we introduce the novel approach of infusing semantic information into the prior…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques
