Investigating the Reasonable Effectiveness of Speaker Pre-Trained Models and their Synergistic Power for SingMOS Prediction
Orchid Chetia Phukan, Girish, Mohd Mujtaba Akhtar, Swarup Ranjan Behera, Pailla Balakrishna Reddy, Arun Balaji Buduru, Rajesh Sharma

TL;DR
This paper evaluates the effectiveness of speaker recognition pre-trained models for SingMOS prediction, demonstrating that their fusion with a novel BATCH framework achieves state-of-the-art results in assessing singing voice quality.
Contribution
It introduces a new fusion framework, BATCH, that combines speaker recognition PTMs with music PTMs, significantly improving SingMOS prediction performance.
Findings
SPTMs outperform other PTMs in SingMOS prediction
BATCH fusion achieves top performance and sets new state-of-the-art results
Fusion of speaker recognition and music PTMs enhances prediction accuracy
Abstract
In this study, we focus on Singing Voice Mean Opinion Score (SingMOS) prediction. Previous research have shown the performance benefit with the use of state-of-the-art (SOTA) pre-trained models (PTMs). However, they haven't explored speaker recognition speech PTMs (SPTMs) such as x-vector, ECAPA and we hypothesize that it will be the most effective for SingMOS prediction. We believe that due to their speaker recognition pre-training, it equips them to capture fine-grained vocal features (e.g., pitch, tone, intensity) from synthesized singing voices in a much more better way than other PTMs. Our experiments with SOTA PTMs including SPTMs and music PTMs validates the hypothesis. Additionally, we introduce a novel fusion framework, BATCH that uses Bhattacharya Distance for fusion of PTMs. Through BATCH with the fusion of speaker recognition SPTMs, we report the topmost performance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsFocus
