Selection of Layers from Self-supervised Learning Models for Predicting Mean-Opinion-Score of Speech

Xinyu Liang; Fredrik Cumlin; Victor Ungureanu; Chandan K. A. Reddy; Christian Schuldt; Saikat Chatterjee

arXiv:2508.08962·eess.AS·August 13, 2025

Selection of Layers from Self-supervised Learning Models for Predicting Mean-Opinion-Score of Speech

Xinyu Liang, Fredrik Cumlin, Victor Ungureanu, Chandan K. A. Reddy, Christian Schuldt, Saikat Chatterjee

PDF

Open Access

TL;DR

This study systematically evaluates different layers of self-supervised speech models for predicting speech quality scores, finding early layers often outperform or match last-layer features, leading to improved accuracy and efficiency.

Contribution

It demonstrates that selecting early layers from SSL models enhances speech quality prediction, challenging the common reliance on last-layer features.

Findings

01

Early-layer features outperform or match last-layer features in MOS prediction.

02

Using early layers improves performance over conventional methods.

03

The approach reduces system complexity while maintaining high accuracy.

Abstract

Self-supervised learning (SSL) models like Wav2Vec2, HuBERT, and WavLM have been widely used in speech processing. These transformer-based models consist of multiple layers, each capturing different levels of representation. While prior studies explored their layer-wise representations for efficiency and performance, speech quality assessment (SQA) models predominantly rely on last-layer features, leaving intermediate layers underexamined. In this work, we systematically evaluate different layers of multiple SSL models for predicting mean-opinion-score (MOS). Features from each layer are fed into a lightweight regression network to assess effectiveness. Our experiments consistently show early-layers features outperform or match those from the last layer, leading to significant improvements over conventional approaches and state-of-the-art MOS prediction models. These findings highlight…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Emotion and Mood Recognition