Leveraging LLMs for Scalable Non-intrusive Speech Quality Assessment
Fredrik Cumlin, Xinyu Liang, Anubhab Ghosh, Saikat Chatterjee

TL;DR
This paper explores using large language models as pseudo-raters to generate labeled data for training speech quality assessment systems, aiming to overcome data scarcity and improve generalization across diverse datasets.
Contribution
It introduces a novel two-stage training approach leveraging LLM-generated labels, enhancing speech quality assessment performance and scalability.
Findings
Two-stage training improves correlation with human ratings.
LLM-labeled data can supplement limited human annotations.
The approach enhances generalization across datasets and languages.
Abstract
Non-intrusive speech quality assessment (SQA) systems suffer from limited training data and costly human annotations, hindering their generalization to real-time conferencing calls. In this work, we propose leveraging large language models (LLMs) as pseudo-raters for speech quality to address these data bottlenecks. We construct LibriAugmented, a dataset consisting of 101,129 speech clips with simulated degradations labeled by a fine-tuned auditory LLM (Vicuna-7b-v1.5). We compare three training strategies: using human-labeled data, using LLM-labeled data, and a two-stage approach (pretraining on LLM labels, then fine-tuning on human labels), using both DNSMOS Pro and DeePMOS. We test on several datasets across languages and quality degradations. While LLM-labeled training yields mixed results compared to human-labeled training, we provide empirical evidence that the two-stage approach…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Advanced Data Compression Techniques
