Self-Supervised Speech Quality Assessment (S3QA): Leveraging Speech Foundation Models for a Scalable Speech Quality Metric

Mattson Ogg; Caitlyn Bishop; Han Yi; Sarah Robinson

arXiv:2506.01655·eess.AS·October 9, 2025

Self-Supervised Speech Quality Assessment (S3QA): Leveraging Speech Foundation Models for a Scalable Speech Quality Metric

Mattson Ogg, Caitlyn Bishop, Han Yi, Sarah Robinson

PDF

Open Access

TL;DR

This paper introduces S3QA, a scalable self-supervised speech quality assessment model leveraging foundation models, which accurately predicts speech degradation across diverse acoustic challenges without relying on human ratings.

Contribution

The paper presents a novel self-supervised approach using WavLM and transformer models to assess speech quality, eliminating the need for labor-intensive human ratings and improving generalization across conditions.

Findings

01

Accurately predicts speech degradation in diverse acoustic environments.

02

Aligns well with human ratings and speech recognition performance.

03

Demonstrates robustness across multiple unseen datasets.

Abstract

Methods for automatically assessing speech quality in real world environments are critical for developing robust human language technologies and assistive devices. Behavioral ratings provided by human raters (e.g., mean opinion scores; MOS) are considered the gold standard, but they are susceptible to variability between individual raters, cannot easily be generalized across corpora, and are labor-intensive to collect, thus limiting the acoustic challenges they can quantify. Here, we present a new, scalable method for automatically assessing speech quality: the self-supervised speech quality assessment (S3QA) model. First, we manipulated high quality utterances from multiple speech corpora, using a wide range of acoustic challenges intended to emulate common sources of quality degradation in the real-world: frequency filtering, reverberation, background noise, and digital compression.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Advanced Data Compression Techniques