TL;DR
WARP-Q is a new objective speech quality metric that accurately predicts the quality of generative neural speech codecs, outperforming traditional models in correlation and ranking across various codecs and noise conditions.
Contribution
The paper introduces WARP-Q, a novel full-reference speech quality metric tailored for generative neural speech codecs, addressing limitations of existing models.
Findings
WARP-Q shows higher correlation with subjective quality assessments.
It effectively ranks codecs and is robust to perceptual signal changes.
WARP-Q outperforms traditional metrics like POLQA and ViSQOL.
Abstract
Good speech quality has been achieved using waveform matching and parametric reconstruction coders. Recently developed very low bit rate generative codecs can reconstruct high quality wideband speech with bit streams less than 3 kb/s. These codecs use a DNN with parametric input to synthesise high quality speech outputs. Existing objective speech quality models (e.g., POLQA, ViSQOL) do not accurately predict the quality of coded speech from these generative models underestimating quality due to signal differences not highlighted in subjective listening tests. We present WARP-Q, a full-reference objective speech quality metric that uses dynamic time warping cost for MFCC speech representations. It is robust to small perceptual signal changes. Evaluation using waveform matching, parametric and generative neural vocoder based codecs as well as channel and environmental noise shows that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
