Low Bitrate High-Quality RVQGAN-based Discrete Speech Tokenizer

Slava Shechtman; Avihu Dekel

arXiv:2410.08325·eess.AS·October 14, 2024·Interspeech

Low Bitrate High-Quality RVQGAN-based Discrete Speech Tokenizer

Slava Shechtman, Avihu Dekel

PDF

1 Repo

TL;DR

This paper presents a fine-tuned RVQGAN-based discrete speech tokenizer that achieves high-quality speech reconstruction at low bitrates of 150-300 tokens per second, suitable for efficient audio compression.

Contribution

It introduces a novel low-bitrate speech tokenizer based on RVQGAN, fine-tuned on diverse speech data, enabling near-indistinguishable reconstruction at significantly reduced token rates.

Findings

01

Achieves speech reconstruction quality comparable to PCM.

02

Operates effectively at 150-300 tokens per second.

03

Demonstrates robustness across various recording conditions.

Abstract

Discrete Audio codecs (or audio tokenizers) have recently regained interest due to the ability of Large Language Models (LLMs) to learn their compressed acoustic representations. Various publicly available trainable discrete tokenizers recently demonstrated impressive results for audio tokenization, yet they mostly require high token rates to gain high-quality reconstruction. In this study, we fine-tuned an open-source general audio RVQGAN model using diverse open-source speech data, considering various recording conditions and quality levels. The resulting wideband (24kHz) speech-only model achieves speech reconstruction, which is nearly indistinguishable from PCM (pulse-code modulation) with a rate of 150-300 tokens per second (1500-3000 bps). The evaluation used comprehensive English speech data encompassing different recording conditions, including studio settings. Speech samples…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

descriptinc/descript-audio-codec
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.