Investigating Neural Audio Codecs for Speech Language Model-Based Speech   Generation

Jiaqi Li; Dongmei Wang; Xiaofei Wang; Yao Qian; Long Zhou; Shujie Liu,; Midia Yousefi; Canrun Li; Chung-Hsien Tsai; Zhen Xiao; Yanqing Liu; Junkun; Chen; Sheng Zhao; Jinyu Li; Zhizheng Wu; Michael Zeng

arXiv:2409.04016·cs.SD·September 9, 2024

Investigating Neural Audio Codecs for Speech Language Model-Based Speech Generation

Jiaqi Li, Dongmei Wang, Xiaofei Wang, Yao Qian, Long Zhou, Shujie Liu,, Midia Yousefi, Canrun Li, Chung-Hsien Tsai, Zhen Xiao, Yanqing Liu, Junkun, Chen, Sheng Zhao, Jinyu Li, Zhizheng Wu, Michael Zeng

PDF

Open Access

TL;DR

This paper systematically studies how neural audio codec tokens influence speech generation in speech language models, revealing that high-quality codecs are essential for naturalness but not necessarily for intelligibility.

Contribution

It provides a comparative analysis of neural codec models within SLM frameworks, highlighting key factors affecting speech quality and intelligibility.

Findings

01

Better codec reconstruction does not always improve speech generation.

02

High-quality decoders are crucial for natural speech production.

03

Speech intelligibility depends more on quantization mechanisms.

Abstract

Neural audio codec tokens serve as the fundamental building blocks for speech language model (SLM)-based speech generation. However, there is no systematic understanding on how the codec system affects the speech generation performance of the SLM. In this work, we examine codec tokens within SLM framework for speech generation to provide insights for effective codec design. We retrain existing high-performing neural codec models on the same data set and loss functions to compare their performance in a uniform setting. We integrate codec tokens into two SLM systems: masked-based parallel speech generation system and an auto-regressive (AR) plus non-auto-regressive (NAR) model-based system. Our findings indicate that better speech reconstruction in codec systems does not guarantee improved speech generation in SLM. A high-quality codec decoder is crucial for natural speech production in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing