Investigating the Reasonable Effectiveness of Speaker Pre-Trained Models and their Synergistic Power for SingMOS Prediction

Orchid Chetia Phukan; Girish; Mohd Mujtaba Akhtar; Swarup Ranjan Behera; Pailla Balakrishna Reddy; Arun Balaji Buduru; Rajesh Sharma

arXiv:2506.02232·eess.AS·June 4, 2025

Investigating the Reasonable Effectiveness of Speaker Pre-Trained Models and their Synergistic Power for SingMOS Prediction

Orchid Chetia Phukan, Girish, Mohd Mujtaba Akhtar, Swarup Ranjan Behera, Pailla Balakrishna Reddy, Arun Balaji Buduru, Rajesh Sharma

PDF

Open Access

TL;DR

This paper evaluates the effectiveness of speaker recognition pre-trained models for SingMOS prediction, demonstrating that their fusion with a novel BATCH framework achieves state-of-the-art results in assessing singing voice quality.

Contribution

It introduces a new fusion framework, BATCH, that combines speaker recognition PTMs with music PTMs, significantly improving SingMOS prediction performance.

Findings

01

SPTMs outperform other PTMs in SingMOS prediction

02

BATCH fusion achieves top performance and sets new state-of-the-art results

03

Fusion of speaker recognition and music PTMs enhances prediction accuracy

Abstract

In this study, we focus on Singing Voice Mean Opinion Score (SingMOS) prediction. Previous research have shown the performance benefit with the use of state-of-the-art (SOTA) pre-trained models (PTMs). However, they haven't explored speaker recognition speech PTMs (SPTMs) such as x-vector, ECAPA and we hypothesize that it will be the most effective for SingMOS prediction. We believe that due to their speaker recognition pre-training, it equips them to capture fine-grained vocal features (e.g., pitch, tone, intensity) from synthesized singing voices in a much more better way than other PTMs. Our experiments with SOTA PTMs including SPTMs and music PTMs validates the hypothesis. Additionally, we introduce a novel fusion framework, BATCH that uses Bhattacharya Distance for fusion of PTMs. Through BATCH with the fusion of speaker recognition SPTMs, we report the topmost performance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsFocus