Speech Quality Assessment Model Based on Mixture of Experts: System-Level Performance Enhancement and Utterance-Level Challenge Analysis

Xintong Hu; Yixuan Chen; Rui Yang; Wenxiang Guo; Changhao Pan

arXiv:2507.06116·cs.SD·July 9, 2025

Speech Quality Assessment Model Based on Mixture of Experts: System-Level Performance Enhancement and Utterance-Level Challenge Analysis

Xintong Hu, Yixuan Chen, Rui Yang, Wenxiang Guo, Changhao Pan

PDF

Open Access

TL;DR

This paper introduces a speech quality assessment model using a Mixture of Experts architecture with self-supervised learning and synthetic data augmentation, aiming to improve system-level and utterance-level predictions, but finds limited gains at the sentence level.

Contribution

It proposes a novel MoE-based MOS prediction system leveraging self-supervised models and synthetic data, and analyzes the challenges in improving sentence-level speech quality assessment.

Findings

01

Limited performance improvement at sentence-level prediction

02

Identified fundamental causes of performance variation across granularities

03

Provided new insights and pathways for speech quality assessment research

Abstract

Automatic speech quality assessment plays a crucial role in the development of speech synthesis systems, but existing models exhibit significant performance variations across different granularity levels of prediction tasks. This paper proposes an enhanced MOS prediction system based on self-supervised learning speech models, incorporating a Mixture of Experts (MoE) classification head and utilizing synthetic data from multiple commercial generation models for data augmentation. Our method builds upon existing self-supervised models such as wav2vec2, designing a specialized MoE architecture to address different types of speech quality assessment tasks. We also collected a large-scale synthetic speech dataset encompassing the latest text-to-speech, speech conversion, and speech enhancement systems. However, despite the adoption of the MoE architecture and expanded dataset, the model's…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Face recognition and analysis

MethodsMixture of Experts