TL;DR
This paper presents a fine-tuned RVQGAN-based discrete speech tokenizer that achieves high-quality speech reconstruction at low bitrates of 150-300 tokens per second, suitable for efficient audio compression.
Contribution
It introduces a novel low-bitrate speech tokenizer based on RVQGAN, fine-tuned on diverse speech data, enabling near-indistinguishable reconstruction at significantly reduced token rates.
Findings
Achieves speech reconstruction quality comparable to PCM.
Operates effectively at 150-300 tokens per second.
Demonstrates robustness across various recording conditions.
Abstract
Discrete Audio codecs (or audio tokenizers) have recently regained interest due to the ability of Large Language Models (LLMs) to learn their compressed acoustic representations. Various publicly available trainable discrete tokenizers recently demonstrated impressive results for audio tokenization, yet they mostly require high token rates to gain high-quality reconstruction. In this study, we fine-tuned an open-source general audio RVQGAN model using diverse open-source speech data, considering various recording conditions and quality levels. The resulting wideband (24kHz) speech-only model achieves speech reconstruction, which is nearly indistinguishable from PCM (pulse-code modulation) with a rate of 150-300 tokens per second (1500-3000 bps). The evaluation used comprehensive English speech data encompassing different recording conditions, including studio settings. Speech samples…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
