Audio Large Language Models Can Be Descriptive Speech Quality Evaluators
Chen Chen, Yuchen Hu, Siyin Wang, Helin Wang, Zhehuai Chen, Chao, Zhang, Chao-Han Huck Yang, and Eng Siong Chng

TL;DR
This paper introduces a new speech evaluation corpus and an alignment approach with LLM distillation, enabling audio large language models to better assess speech quality and generate descriptive, human-like judgments, surpassing previous models.
Contribution
It presents the first natural language-based speech evaluation corpus and an alignment method that improves speech quality assessment in audio LLMs.
Findings
ALLD outperforms previous regression models in MOS prediction
Achieves 98.6% accuracy in A/B speech sample comparisons
Generated responses surpass task-specific models in BLEU scores
Abstract
An ideal multimodal agent should be aware of the quality of its input modalities. Recent advances have enabled large language models (LLMs) to incorporate auditory systems for handling various speech-related tasks. However, most audio LLMs remain unaware of the quality of the speech they process. This limitation arises because speech quality evaluation is typically excluded from multi-task training due to the lack of suitable datasets. To address this, we introduce the first natural language-based speech evaluation corpus, generated from authentic human ratings. In addition to the overall Mean Opinion Score (MOS), this corpus offers detailed analysis across multiple dimensions and identifies causes of quality degradation. It also enables descriptive comparisons between two speech samples (A/B tests) with human-like judgment. Leveraging this corpus, we propose an alignment approach with…
Peer Reviews
Decision·ICLR 2025 Poster
This paper proposes to predict the MOS within the framework of a currently popular audio LLM scheme. The authors present their method clearly and effectively, detailing how the model assesses various aspects of audio quality before generating an overall MOS score. Experimental results demonstrate the effectiveness of this approach, indicating that the proposed method achieves a higher accuracy in MOS prediction compared to traditional regression-based methods.
- According to the ITU-T definition, the MOS should be an integer between 1 and 5. However, in the example provided in Section 1, the sentence ".....Taking into account all factors, the overall MOS score is only 2.4" conflicts with this definition, as MOS should not be a decimal. This discrepancy suggests a fundamental misalignment with the official MOS standard, which could impact the validity of the work. - Although the authors extend the application of audio LLMs to MOS prediction, this appea
The paper shows clear outcomes for the novel proposal and experiment - ALLD achieves the best performance across all systems according to evaluation metrics, and the BLEU score demonstrates the efficacy of this distillation strategy - Paper describes means of generating evaluation data which is descriptive and can improve Audio LLM performance
While the improvements from the experiments have shown improvement its not clear on why the LCC and SRCC haven't improved for LIVE and P501 datasets. Also, descriptive language is subjective to users, unlike evaluation score like BLEU, how do you propose to adhere to similar descriptive style for the evaluation generation
**Originality**: This work introduces a descriptive, language-based dataset for speech quality assessment, allowing audio LLMs to conduct more nuanced evaluations. The ALLD framework presents an innovative approach by guiding audio LLMs through token-level distillation. **Quality**: The comparisons between traditional regression methods and descriptive, audio LLM-based approaches offer valuable insights into fine-tuning and demonstrate the effectiveness of natural language guidance. **Clarit
* While a strengths and weaknesses comparison between two systems across specific sub-dimensions is reasonable, it is unclear how a human or LLM might synthesize these into an overall preference judgment. For example, if System 1 outperforms System 2 in one sub-dimension but falls behind in another, the basis for an overarching preference remains ambiguous. The dataset relies on LLM responses to make arbitrary decisions on whether Speech A or Speech B is preferable, using sub-dimensional scores
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
MethodsAttentive Walk-Aggregating Graph Neural Network
