Multi-Sampling-Frequency Naturalness MOS Prediction Using Self-Supervised Learning Model with Sampling-Frequency-Independent Layer
Go Nishikawa, Wataru Nakata, Yuki Saito, Kanami Imamura, Hiroshi Saruwatari, Tomohiko Nakamura

TL;DR
This paper presents a self-supervised learning model with an SF-independent layer for predicting speech naturalness across multiple sampling frequencies, achieving top rankings in the AudioMOS Challenge 2025.
Contribution
It introduces an SF-independent convolutional layer within an SSL model for robust MOS prediction across various sampling frequencies, with strategies like knowledge distillation and large-scale pretraining.
Findings
Ranked first in one evaluation metric at AMC 2025
Achieved fourth place in overall ranking
Demonstrated effectiveness of SF-independent features
Abstract
We introduce our submission to the AudioMOS Challenge (AMC) 2025 Track 3: mean opinion score (MOS) prediction for speech with multiple sampling frequencies (SFs). Our submitted model integrates an SF-independent (SFI) convolutional layer into a self-supervised learning (SSL) model to achieve SFI speech feature extraction for MOS prediction. We present some strategies to improve the MOS prediction performance of our model: distilling knowledge from a pretrained non-SFI-SSL model and pretraining with a large-scale MOS dataset. Our submission to the AMC 2025 Track 3 ranked the first in one evaluation metric and the fourth in the final ranking. We also report the results of our ablation study to investigate essential factors of our model.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
