AttentiveMOS: A Lightweight Attention-Only Model for Speech Quality Prediction
Imran E Kibria, Donald S. Williamson

TL;DR
AttentiveMOS introduces a lightweight, attention-only neural network leveraging Swin transformer and standard transformer layers to predict speech quality scores, improving generalization and real-world applicability over existing models.
Contribution
The paper presents a novel attention-only architecture for speech quality prediction that avoids large pretrained networks, enhancing efficiency and generalization on limited datasets.
Findings
Outperforms baseline models on three datasets.
Improves generalization with a self-teaching training strategy.
Efficient design suitable for real-world applications.
Abstract
Research in modeling subjective metrics for quality assessment has led to the development of no-reference speech models that directly operate on utterance waveforms to predict the Mean Opinion Score (MOS). These models often rely on convolutional layers for local feature extraction and embeddings from impractically large pretrained networks to enhance generalization. We propose an attention-only model based on Swin transformer and standard transformer layers to extract local context features and global utterance features, respectively. The self-attention operator excels at processing sequences, and our lightweight design enhances generalization on limited MOS datasets while improving real-world applicability. We train our network using a sequential self-teaching strategy to improve generalization on MOS labels affected by noise in listener ratings. Experiments on three datasets confirm…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing
MethodsAdam · Attention Is All You Need · Dropout · Dense Connections · Layer Normalization · Residual Connection · Stochastic Depth · Position-Wise Feed-Forward Layer · Linear Layer · Byte Pair Encoding
