Preference-based training framework for automatic speech quality assessment using deep neural network
Cheng-Hung Hu, Yusuke Yasuda, Tomoki Toda

TL;DR
This paper introduces a preference-based training framework for speech quality assessment using deep neural networks, focusing on ranking synthetic speech systems more effectively than traditional score-based methods.
Contribution
It proposes a novel training framework that leverages preference scores from MOS pairs to improve system ranking accuracy in speech quality assessment.
Findings
Framework outperforms baseline in Spearman's Rank Correlation
Effective pair generation and aggregation functions identified
Conditions for optimal framework performance analyzed
Abstract
One objective of Speech Quality Assessment (SQA) is to estimate the ranks of synthetic speech systems. However, recent SQA models are typically trained using low-precision direct scores such as mean opinion scores (MOS) as the training objective, which is not straightforward to estimate ranking. Although it is effective for predicting quality scores of individual sentences, this approach does not account for speech and system preferences when ranking multiple systems. We propose a training framework of SQA models that can be trained with only preference scores derived from pairs of MOS to improve ranking prediction. Our experiment reveals conditions where our framework works the best in terms of pair generation, aggregation functions to derive system score from utterance preferences, and threshold functions to determine preference from a pair of MOS. Our results demonstrate that our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
