Fewer-token Neural Speech Codec with Time-invariant Codes

Yong Ren; Tao Wang; Jiangyan Yi; Le Xu; Jianhua Tao; Chuyuan Zhang,; Junzuo Zhou

arXiv:2310.00014·cs.SD·March 12, 2024

Fewer-token Neural Speech Codec with Time-invariant Codes

Yong Ren, Tao Wang, Jiangyan Yi, Le Xu, Jianhua Tao, Chuyuan Zhang,, Junzuo Zhou

PDF

Open Access 1 Repo

TL;DR

This paper introduces TiCodec, a neural speech codec that uses time-invariant codes to reduce token count, improve speech reconstruction quality, and enhance zero-shot TTS performance.

Contribution

The paper proposes a novel time-invariant neural speech codec with a consistency loss, reducing token count and improving TTS quality and similarity.

Findings

01

Fewer tokens are needed for speech representation with TiCodec.

02

Speech reconstruction quality is improved with TiCodec.

03

Zero-shot TTS performance is enhanced, with higher naturalness and lower word error rate.

Abstract

Language model based text-to-speech (TTS) models, like VALL-E, have gained attention for their outstanding in-context learning capability in zero-shot scenarios. Neural speech codec is a critical component of these models, which can convert speech into discrete token representations. However, excessive token sequences from the codec may negatively affect prediction accuracy and restrict the progression of Language model based TTS models. To address this issue, this paper proposes a novel neural speech codec with time-invariant codes named TiCodec. By encoding and quantizing time-invariant information into a separate code, TiCodec can reduce the amount of frame-level information that needs encoding, effectively decreasing the number of tokens as codes of speech. Furthermore, this paper introduces a time-invariant encoding consistency loss to enhance the consistency of time-invariant code…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

y-ren16/ticodec
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques