High Fidelity Neural Audio Compression
Alexandre D\'efossez, Jade Copet, Gabriel Synnaeve, Yossi Adi

TL;DR
This paper presents a neural audio codec that achieves high-fidelity, real-time compression across various audio types, utilizing a novel training stabilization method and lightweight Transformers for further compression.
Contribution
It introduces a new neural audio codec with a stable training mechanism and demonstrates effective compression with lightweight Transformers, outperforming existing methods.
Findings
Superior audio quality across multiple domains and bandwidths
40% additional compression with lightweight Transformers
Stable training achieved through a novel loss balancer mechanism
Abstract
We introduce a state-of-the-art real-time, high-fidelity, audio codec leveraging neural networks. It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion. We simplify and speed-up the training by using a single multiscale spectrogram adversary that efficiently reduces artifacts and produce high-quality samples. We introduce a novel loss balancer mechanism to stabilize training: the weight of a loss now defines the fraction of the overall gradient it should represent, thus decoupling the choice of this hyper-parameter from the typical scale of the loss. Finally, we study how lightweight Transformer models can be used to further compress the obtained representation by up to 40%, while staying faster than real time. We provide a detailed description of the key design choices of the proposed model including: training objective,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗WhisperSpeech/WhisperSpeechmodel· ♡ 250♡ 250
- 🤗facebook/encodec_24khzmodel· 108k dl· ♡ 54108k dl♡ 54
- 🤗facebook/encodec_48khzmodel· 15k dl· ♡ 3515k dl♡ 35
- 🤗alibaba-damo/audio_codec-encodec-en-libritts-16k-nq32ds320-pytorchmodel· 8 dl· ♡ 18 dl♡ 1
- 🤗alibaba-damo/audio_codec-encodec-en-libritts-16k-nq32ds640-pytorchmodel· 7 dl· ♡ 17 dl♡ 1
- 🤗alibaba-damo/audio_codec-encodec-zh_en-general-16k-nq32ds320-pytorchmodel· 6 dl· ♡ 16 dl♡ 1
- 🤗alibaba-damo/audio_codec-encodec-zh_en-general-16k-nq32ds640-pytorchmodel· 4 dl· ♡ 34 dl♡ 3
- 🤗alibaba-damo/audio_codec-freqcodec_magphase-en-libritts-16k-gr1nq32ds320-pytorchmodel· 3 dl3 dl
- 🤗alibaba-damo/audio_codec-freqcodec_magphase-en-libritts-16k-gr8nq32ds320-pytorchmodel· 4 dl· ♡ 14 dl♡ 1
- 🤗fierce-cats/beatrice-trainermodel· ♡ 39♡ 39
Videos
High Fidelity Neural Audio Compression | Paper & Code Explained· youtube
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Image and Signal Denoising Methods
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dense Connections · Softmax · Position-Wise Feed-Forward Layer · Adam · Label Smoothing · Absolute Position Encodings · Layer Normalization
