SpeechQualityLLM: LLM-Based Multimodal Assessment of Speech Quality
Mahathir Monjur, Shahriar Nirjon

TL;DR
SpeechQualityLLM is a multimodal system combining audio encoding and language modeling to assess speech quality, enabling natural-language queries and explanations, reducing the need for costly listening tests.
Contribution
It introduces a novel LLM-based multimodal approach for speech quality assessment that supports interactive queries and textual rationales, surpassing traditional metrics and models.
Findings
Achieves a MOS MAE of 0.41 and correlation of 0.86 on NISQA data.
Supports natural-language questions about speech degradations.
Provides diverse, human-like quality judgments.
Abstract
Objective speech quality assessment is central to telephony, VoIP, and streaming systems, where large volumes of degraded audio must be monitored and optimized at scale. Classical metrics such as PESQ and POLQA approximate human mean opinion scores (MOS) but require carefully controlled conditions and expensive listening tests, while learning-based models such as NISQA regress MOS and multiple perceptual dimensions from waveforms or spectrograms, achieving high correlation with subjective ratings yet remaining rigid: they do not support interactive, natural-language queries and do not natively provide textual rationales. In this work, we introduce SpeechQualityLLM, a multimodal speech quality question-answering (QA) system that couples an audio encoder with a language model and is trained on the NISQA corpus using template-based question-answer pairs covering overall MOS and four…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Image and Video Quality Assessment
