Unifying Listener Scoring Scales: Comparison Learning Framework for Speech Quality Assessment and Continuous Speech Emotion Recognition
Cheng-Hung Hu, Yusuke Yasuda, Akifumi Yoshimoto, Tomoki Toda

TL;DR
This paper proposes a unified listener scoring scale framework that leverages comparison scores to improve speech quality and emotion recognition tasks, addressing listener bias and enhancing prediction accuracy.
Contribution
It introduces a novel comparison learning framework that models a unified listener scoring scale, overcoming biases from individual listener ratings in speech assessment tasks.
Findings
Improved prediction accuracy in SQA and CSER tasks.
Effective modeling of listener scoring relationships.
Robustness across different speech assessment scenarios.
Abstract
Speech Quality Assessment (SQA) and Continuous Speech Emotion Recognition (CSER) are two key tasks in speech technology, both relying on listener ratings. However, these ratings are inherently biased due to individual listener factors. Previous approaches have introduced a mean listener scoring scale and modeled all listener scoring scales in the training set. However, the mean listener approach is prone to distortion from averaging ordinal data, leading to potential biases. Moreover, learning multiple listener scoring scales while inferring based only on the mean listener scale limits effectiveness. In contrast, our method focuses on modeling a unified listener scoring scale, using comparison scores to correctly capture the scoring relationships between utterances. Experimental results show that our method effectively improves prediction performance in both SQA and CSER tasks, proving…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Speech and Audio Processing · Speech Recognition and Synthesis
