High-Fidelity Audio Compression with Improved RVQGAN
Rithesh Kumar, Prem Seetharaman, Alejandro Luebs, Ishaan Kumar, Kundan, Kumar

TL;DR
This paper presents a universal neural audio compression method that achieves high fidelity and significant compression ratios across various audio domains, utilizing advanced vector quantization and adversarial training techniques.
Contribution
Introduces a universal neural audio compression algorithm combining high-fidelity generation with improved vector quantization, outperforming existing methods across all audio types.
Findings
Achieves ~90x compression at 8kbps for 44.1 KHz audio.
Outperforms existing audio compression algorithms.
Provides comprehensive ablation studies and open-source resources.
Abstract
Language models have been successfully used to model natural signals, such as images, speech, and music. A key component of these models is a high quality neural compression model that can compress high-dimensional natural signals into lower dimensional discrete tokens. To that end, we introduce a high-fidelity universal neural audio compression algorithm that achieves ~90x compression of 44.1 KHz audio into tokens at just 8kbps bandwidth. We achieve this by combining advances in high-fidelity audio generation with better vector quantization techniques from the image domain, along with improved adversarial and reconstruction losses. We compress all domains (speech, environment, music, etc.) with a single universal model, making it widely applicable to generative modeling of all audio. We compare with competing audio compression algorithms, and find our method outperforms them…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗taresh18/nano-codecmodel· 10 dl· ♡ 110 dl♡ 1
- 🤗descript/descript-audio-codecmodel· ♡ 18♡ 18
- 🤗parler-tts/dac_44khZ_8kbpsmodel· 50 dl· ♡ 1950 dl♡ 19
- 🤗sarulab-speech/UTDUSS-Vocodermodel· ♡ 2♡ 2
- 🤗pharaouk/dac_44khZ_8kbpsmodel· 4 dl4 dl
- 🤗fierce-cats/beatrice-trainermodel· ♡ 39♡ 39
- 🤗nvidia/low-frame-rate-speech-codec-22khzmodel· 172 dl· ♡ 19172 dl♡ 19
- 🤗Blinorot/dac_finetuned_librispeechmodel
- 🤗nvidia/nemo-nano-codec-22khz-1.78kbps-12.5fpsmodel· 2.4k dl· ♡ 102.4k dl♡ 10
- 🤗nvidia/nemo-nano-codec-22khz-1.89kbps-21.5fpsmodel· 3.5k dl· ♡ 103.5k dl♡ 10
Videos
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Music Technology and Sound Studies
