Multi-Sampling-Frequency Naturalness MOS Prediction Using Self-Supervised Learning Model with Sampling-Frequency-Independent Layer

Go Nishikawa; Wataru Nakata; Yuki Saito; Kanami Imamura; Hiroshi Saruwatari; Tomohiko Nakamura

arXiv:2507.14647·cs.SD·August 20, 2025

Multi-Sampling-Frequency Naturalness MOS Prediction Using Self-Supervised Learning Model with Sampling-Frequency-Independent Layer

Go Nishikawa, Wataru Nakata, Yuki Saito, Kanami Imamura, Hiroshi Saruwatari, Tomohiko Nakamura

PDF

TL;DR

This paper presents a self-supervised learning model with an SF-independent layer for predicting speech naturalness across multiple sampling frequencies, achieving top rankings in the AudioMOS Challenge 2025.

Contribution

It introduces an SF-independent convolutional layer within an SSL model for robust MOS prediction across various sampling frequencies, with strategies like knowledge distillation and large-scale pretraining.

Findings

01

Ranked first in one evaluation metric at AMC 2025

02

Achieved fourth place in overall ranking

03

Demonstrated effectiveness of SF-independent features

Abstract

We introduce our submission to the AudioMOS Challenge (AMC) 2025 Track 3: mean opinion score (MOS) prediction for speech with multiple sampling frequencies (SFs). Our submitted model integrates an SF-independent (SFI) convolutional layer into a self-supervised learning (SSL) model to achieve SFI speech feature extraction for MOS prediction. We present some strategies to improve the MOS prediction performance of our model: distilling knowledge from a pretrained non-SFI-SSL model and pretraining with a large-scale MOS dataset. Our submission to the AMC 2025 Track 3 ranked the first in one evaluation metric and the fourth in the final ranking. We also report the results of our ablation study to investigate essential factors of our model.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.