Evaluating Objective Speech Quality Metrics for Neural Audio Codecs
Luca A. Lanzend\"orfer, Florian Gr\"otschla

TL;DR
This paper evaluates the effectiveness of existing objective speech quality metrics in assessing neural audio codecs, comparing them to human listening tests to identify which metrics reliably reflect perceived audio quality.
Contribution
It provides an empirical analysis of objective metrics' correlation with human perception for neural audio codecs, offering guidance for future evaluations.
Findings
Some metrics correlate well with human perception
Certain metrics fail to capture relevant distortions
Guidance for selecting evaluation metrics in neural audio codecs
Abstract
Neural audio codecs have gained recent popularity for their use in generative modeling as they offer high-fidelity audio reconstruction at low bitrates. While human listening studies remain the gold standard for assessing perceptual quality, they are time-consuming and impractical. In this work, we examine the reliability of existing objective quality metrics in assessing the performance of recent neural audio codecs. To this end, we conduct a MUSHRA listening test on high-fidelity speech signals and analyze the correlation between subjective scores and widely used objective metrics. Our results show that, while some metrics align well with human perception, others struggle to capture relevant distortions. Our findings provide practical guidance for selecting appropriate evaluation metrics when using neural audio codecs for speech.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Hearing Loss and Rehabilitation · Speech Recognition and Synthesis
