Speaking from Coarse to Fine: Improving Neural Codec Language Model via   Multi-Scale Speech Coding and Generation

Haohan Guo; Fenglong Xie; Dongchao Yang; Xixin Wu; Helen Meng

arXiv:2409.11630·cs.SD·September 19, 2024

Speaking from Coarse to Fine: Improving Neural Codec Language Model via Multi-Scale Speech Coding and Generation

Haohan Guo, Fenglong Xie, Dongchao Yang, Xixin Wu, Helen Meng

PDF

Open Access

TL;DR

This paper introduces CoFi-Speech, a multi-scale neural codec language model for TTS that improves naturalness and speaker similarity by encoding speech at multiple temporal resolutions and employing coarse-to-fine generation strategies.

Contribution

It proposes a novel multi-scale speech encoding and generation framework, CoFi-Speech, that enhances neural TTS by addressing recency bias and capturing coarse-to-fine speech details.

Findings

01

Outperforms single-scale baselines in naturalness and speaker similarity.

02

Effectively learns multi-scale speech representations with high-quality reconstruction.

03

Stack-of-scale generation significantly improves speech quality.

Abstract

The neural codec language model (CLM) has demonstrated remarkable performance in text-to-speech (TTS) synthesis. However, troubled by ``recency bias", CLM lacks sufficient attention to coarse-grained information at a higher temporal scale, often producing unnatural or even unintelligible speech. This work proposes CoFi-Speech, a coarse-to-fine CLM-TTS approach, employing multi-scale speech coding and generation to address this issue. We train a multi-scale neural codec, CoFi-Codec, to encode speech into a multi-scale discrete representation, comprising multiple token sequences with different time resolutions. Then, we propose CoFi-LM that can generate this representation in two modes: the single-LM-based chain-of-scale generation and the multiple-LM-based stack-of-scale generation. In experiments, CoFi-Speech significantly outperforms single-scale baseline systems on naturalness and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis