Speech Quality Assessment Model Based on Mixture of Experts: System-Level Performance Enhancement and Utterance-Level Challenge Analysis
Xintong Hu, Yixuan Chen, Rui Yang, Wenxiang Guo, Changhao Pan

TL;DR
This paper introduces a speech quality assessment model using a Mixture of Experts architecture with self-supervised learning and synthetic data augmentation, aiming to improve system-level and utterance-level predictions, but finds limited gains at the sentence level.
Contribution
It proposes a novel MoE-based MOS prediction system leveraging self-supervised models and synthetic data, and analyzes the challenges in improving sentence-level speech quality assessment.
Findings
Limited performance improvement at sentence-level prediction
Identified fundamental causes of performance variation across granularities
Provided new insights and pathways for speech quality assessment research
Abstract
Automatic speech quality assessment plays a crucial role in the development of speech synthesis systems, but existing models exhibit significant performance variations across different granularity levels of prediction tasks. This paper proposes an enhanced MOS prediction system based on self-supervised learning speech models, incorporating a Mixture of Experts (MoE) classification head and utilizing synthetic data from multiple commercial generation models for data augmentation. Our method builds upon existing self-supervised models such as wav2vec2, designing a specialized MoE architecture to address different types of speech quality assessment tasks. We also collected a large-scale synthetic speech dataset encompassing the latest text-to-speech, speech conversion, and speech enhancement systems. However, despite the adoption of the MoE architecture and expanded dataset, the model's…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Face recognition and analysis
MethodsMixture of Experts
