High Quality Audio Coding with MDCTNet
Grant Davidson, Mark Vinton, Per Ekstrand, Cong Zhou, Lars Villemoes,, and Lie Lu

TL;DR
This paper introduces MDCTNet, a neural audio generative model that operates in the perceptually weighted MDCT domain, capturing time-frequency correlations with RNNs, achieving high-quality audio coding at low bitrates.
Contribution
The paper presents MDCTNet, a novel neural model for audio coding that uses perceptual weighting and RNNs to improve compression quality at low bitrates.
Findings
Achieves similar subjective quality to Opus at half the bitrate.
Operates effectively on diverse fullband monophonic audio signals.
Utilizes a perceptual encoder conditioned generative model.
Abstract
We propose a neural audio generative model, MDCTNet, operating in the perceptually weighted domain of an adaptive modified discrete cosine transform (MDCT). The architecture of the model captures correlations in both time and frequency directions with recurrent layers (RNNs). An audio coding system is obtained by training MDCTNet on a diverse set of fullband monophonic audio signals at 48 kHz sampling, conditioned by a perceptual audio encoder. In a subjective listening test with ten excerpts chosen to be balanced across content types, yet stressful for both codecs, the mean performance of the proposed system for 24 kb/s variable bitrate (VBR) is similar to that of Opus at twice the bitrate.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Advanced Adaptive Filtering Techniques
MethodsTest · Discrete Cosine Transform
