Speech Token Prediction via Compressed-to-fine Language Modeling for Speech Generation

Wenrui Liu; Qian Chen; Wen Wang; Yafeng Chen; Jin Xu; Zhifang Guo; Guanrou Yang; Weiqin Li; Xiaoda Yang; Tao Jin; Minghui Fang; Jialong Zuo; Bai Jionghao; Zemin Liu

arXiv:2505.24496·eess.AS·June 2, 2025·ACM Multimedia

Speech Token Prediction via Compressed-to-fine Language Modeling for Speech Generation

Wenrui Liu, Qian Chen, Wen Wang, Yafeng Chen, Jin Xu, Zhifang Guo, Guanrou Yang, Weiqin Li, Xiaoda Yang, Tao Jin, Minghui Fang, Jialong Zuo, Bai Jionghao, Zemin Liu

PDF

Open Access

TL;DR

This paper introduces a compressed-to-fine language modeling approach that improves speech generation by effectively handling long sequences of speech tokens through local retention and long-range compression.

Contribution

It proposes a novel method that combines local token retention with compression of long-range context to enhance neural codec language models for speech generation.

Findings

01

Improved speech generation quality across various neural audio codecs.

02

Effective reduction of redundant information in long speech token sequences.

03

Enhanced model performance in downstream speech tasks.

Abstract

Neural audio codecs, used as speech tokenizers, have demonstrated remarkable potential in the field of speech generation. However, to ensure high-fidelity audio reconstruction, neural audio codecs typically encode audio into long sequences of speech tokens, posing a significant challenge for downstream language models in long-context modeling. We observe that speech token sequences exhibit short-range dependency: due to the monotonic alignment between text and speech in text-to-speech (TTS) tasks, the prediction of the current token primarily relies on its local context, while long-range tokens contribute less to the current token prediction and often contain redundant information. Inspired by this observation, we propose a \textbf{compressed-to-fine language modeling} approach to address the challenge of long sequence speech tokens within neural codec language models: (1)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Speech and Audio Processing