SNAC: Multi-Scale Neural Audio Codec
Hubert Siuzdak, Florian Gr\"otschla, Luca A. Lanzend\"orfer

TL;DR
This paper introduces the Multi-Scale Neural Audio Codec, an extension of Residual Vector Quantization that operates at multiple temporal resolutions to improve audio compression efficiency, supported by extensive evaluations.
Contribution
It proposes a hierarchical quantization approach that adapts to audio structure across multiple timescales, enhancing neural audio compression.
Findings
Achieves better compression efficiency than standard RVQ
Demonstrates high fidelity audio reconstruction at low bitrates
Open-sourced code and models for reproducibility
Abstract
Neural audio codecs have recently gained popularity because they can represent audio signals with high fidelity at very low bitrates, making it feasible to use language modeling approaches for audio generation and understanding. Residual Vector Quantization (RVQ) has become the standard technique for neural audio compression using a cascade of VQ codebooks. This paper proposes the Multi-Scale Neural Audio Codec, a simple extension of RVQ where the quantizers can operate at different temporal resolutions. By applying a hierarchy of quantizers at variable frame rates, the codec adapts to the audio structure across multiple timescales. This leads to more efficient compression, as demonstrated by extensive objective and subjective evaluations. The code and model weights are open-sourced at https://github.com/hubertsiuzdak/snac.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Neural Networks and Applications
