Speech Quality Embeddings for Improved Detection and Classification of Degradations in Speech Signals
Michael Kuhlmann, Tobias Cord-Landwehr, Reinhold Haeb-Umbach

TL;DR
This paper introduces a method to generate frame-level speech quality embeddings that cluster by degradation type, improving local degradation detection and classification in speech signals.
Contribution
The work extends SSQA models to produce embeddings that distinguish degradation types using contrastive loss and a partial mix-up strategy, enhancing local quality assessment.
Findings
Embedding clusters correspond to different degradation types.
The approach improves degradation detection accuracy.
Out-of-domain data experiments validate robustness.
Abstract
Automatic subjective speech quality assessment (SSQA) traditionally estimates speech quality on an utterance or system level. While this resolution was adequate for older transmission or synthesis systems that produced speech signals of mediocre quality, modern systems generate high-quality speech with degradations that may occur only locally. With suitable model architectures and regularization losses, SSQA models trained with utterance-level targets can also yield useful local predictions of speech quality. In this work, we extend such models to produce frame-level embeddings that cluster by degradation type. Specifically, we employ a partial mix-up strategy on a parallel corpus of clean and degraded utterances and apply a contrastive loss to distinguish between degradation types. Through experiments on both in- and out-of-domain data, we demonstrate that our approach improves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
