SA-SSL-MOS: Self-supervised Learning MOS Prediction with Spectral Augmentation for Generalized Multi-Rate Speech Assessment

Fengyuan Cao; Xinyu Liang; Fredrik Cumlin; Victor Ungureanu; Chandan K. A. Reddy; Christian Schuldt; Saikat Chatterjee

arXiv:2602.14785·eess.AS·February 17, 2026

SA-SSL-MOS: Self-supervised Learning MOS Prediction with Spectral Augmentation for Generalized Multi-Rate Speech Assessment

Fengyuan Cao, Xinyu Liang, Fredrik Cumlin, Victor Ungureanu, Chandan K. A. Reddy, Christian Schuldt, Saikat Chatterjee

PDF

Open Access

TL;DR

This paper introduces a spectral augmentation method with a two-step training scheme for speech quality assessment, effectively leveraging high-frequency information to improve MOS prediction across multiple sampling rates.

Contribution

It proposes a novel SSL-based approach with spectral augmentation and a two-step training process to enhance multi-rate speech quality prediction.

Findings

01

High-frequency features are crucial for accurate multi-rate SQA.

02

Two-step training improves generalization on limited multi-rate data.

03

Spectral augmentation enhances SSL model performance.

Abstract

Designing a speech quality assessment (SQA) system for estimating mean-opinion-score (MOS) of multi-rate speech with varying sampling frequency (16-48 kHz) is a challenging task. The challenge arises due to the limited availability of a MOS-labeled training dataset comprising multi-rate speech samples. While self-supervised learning (SSL) models have been widely adopted in SQA to boost performance, a key limitation is that they are pretrained on 16 kHz speech and therefore discard high-frequency information present in higher sampling rates. To address this issue, we propose a spectrogram-augmented SSL method that incorporates high-frequency features (up to 48 kHz sampling rate) through a parallel-branch architecture. We further introduce a two-step training scheme: the model is first pre-trained on a large 48 kHz dataset and then fine-tuned on a smaller multi-rate dataset. Experimental…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Emotion and Mood Recognition