Investigating Neural Audio Codecs for Speech Language Model-Based Speech Generation
Jiaqi Li, Dongmei Wang, Xiaofei Wang, Yao Qian, Long Zhou, Shujie Liu,, Midia Yousefi, Canrun Li, Chung-Hsien Tsai, Zhen Xiao, Yanqing Liu, Junkun, Chen, Sheng Zhao, Jinyu Li, Zhizheng Wu, Michael Zeng

TL;DR
This paper systematically studies how neural audio codec tokens influence speech generation in speech language models, revealing that high-quality codecs are essential for naturalness but not necessarily for intelligibility.
Contribution
It provides a comparative analysis of neural codec models within SLM frameworks, highlighting key factors affecting speech quality and intelligibility.
Findings
Better codec reconstruction does not always improve speech generation.
High-quality decoders are crucial for natural speech production.
Speech intelligibility depends more on quantization mechanisms.
Abstract
Neural audio codec tokens serve as the fundamental building blocks for speech language model (SLM)-based speech generation. However, there is no systematic understanding on how the codec system affects the speech generation performance of the SLM. In this work, we examine codec tokens within SLM framework for speech generation to provide insights for effective codec design. We retrain existing high-performing neural codec models on the same data set and loss functions to compare their performance in a uniform setting. We integrate codec tokens into two SLM systems: masked-based parallel speech generation system and an auto-regressive (AR) plus non-auto-regressive (NAR) model-based system. Our findings indicate that better speech reconstruction in codec systems does not guarantee improved speech generation in SLM. A high-quality codec decoder is crucial for natural speech production in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing
