RAMP: Retrieval-Augmented MOS Prediction via Confidence-based Dynamic Weighting
Hui Wang, Shiwan Zhao, Xiguang Zheng, Yong Qin

TL;DR
The paper introduces RAMP, a retrieval-augmented method for MOS prediction that dynamically adjusts retrieval scope and fusion weights based on confidence, improving performance in synthetic speech quality evaluation.
Contribution
RAMP enhances MOS prediction by integrating retrieval-augmented features with a confidence-based dynamic weighting mechanism, addressing data scarcity for the decoder.
Findings
Outperforms existing methods in multiple scenarios
Improves decoder performance under data scarcity
Demonstrates effectiveness of confidence-based dynamic weighting
Abstract
Automatic Mean Opinion Score (MOS) prediction is crucial to evaluate the perceptual quality of the synthetic speech. While recent approaches using pre-trained self-supervised learning (SSL) models have shown promising results, they only partly address the data scarcity issue for the feature extractor. This leaves the data scarcity issue for the decoder unresolved and leading to suboptimal performance. To address this challenge, we propose a retrieval-augmented MOS prediction method, dubbed {\bf RAMP}, to enhance the decoder's ability against the data scarcity issue. A fusing network is also proposed to dynamically adjust the retrieval scope for each instance and the fusion weights based on the predictive confidence. Experimental results show that our proposed method outperforms the existing methods in multiple scenarios.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
