AttentiveMOS: A Lightweight Attention-Only Model for Speech Quality Prediction

Imran E Kibria; Donald S. Williamson

arXiv:2410.12675·eess.AS·May 29, 2025

AttentiveMOS: A Lightweight Attention-Only Model for Speech Quality Prediction

Imran E Kibria, Donald S. Williamson

PDF

Open Access

TL;DR

AttentiveMOS introduces a lightweight, attention-only neural network leveraging Swin transformer and standard transformer layers to predict speech quality scores, improving generalization and real-world applicability over existing models.

Contribution

The paper presents a novel attention-only architecture for speech quality prediction that avoids large pretrained networks, enhancing efficiency and generalization on limited datasets.

Findings

01

Outperforms baseline models on three datasets.

02

Improves generalization with a self-teaching training strategy.

03

Efficient design suitable for real-world applications.

Abstract

Research in modeling subjective metrics for quality assessment has led to the development of no-reference speech models that directly operate on utterance waveforms to predict the Mean Opinion Score (MOS). These models often rely on convolutional layers for local feature extraction and embeddings from impractically large pretrained networks to enhance generalization. We propose an attention-only model based on Swin transformer and standard transformer layers to extract local context features and global utterance features, respectively. The self-attention operator excels at processing sequences, and our lightweight design enhances generalization on limited MOS datasets while improving real-world applicability. We train our network using a sequential self-teaching strategy to improve generalization on MOS labels affected by noise in listener ratings. Experiments on three datasets confirm…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing

MethodsAdam · Attention Is All You Need · Dropout · Dense Connections · Layer Normalization · Residual Connection · Stochastic Depth · Position-Wise Feed-Forward Layer · Linear Layer · Byte Pair Encoding