A Study on Zero-shot Non-intrusive Speech Assessment using Large Language Models
Ryandhimas E. Zezario, Sabato M. Siniscalchi, Hsin-Min Wang, Yu Tsao

TL;DR
This paper explores zero-shot non-intrusive speech assessment using large language models, comparing GPT-4o and GPT-Whisper, and demonstrates GPT-Whisper's superior correlation with human assessments and automatic speech recognition metrics.
Contribution
It introduces GPT-Whisper, a novel zero-shot speech assessment method combining Whisper and prompt engineering, showing improved accuracy over existing models without additional training.
Findings
GPT-Whisper outperforms GPT-4o in speech assessment accuracy.
GPT-Whisper has higher correlation with human judgments of speech quality and intelligibility.
GPT-Whisper surpasses supervised models in CER correlation.
Abstract
This work investigates two strategies for zero-shot non-intrusive speech assessment leveraging large language models. First, we explore the audio analysis capabilities of GPT-4o. Second, we propose GPT-Whisper, which uses Whisper as an audio-to-text module and evaluates the naturalness of text via targeted prompt engineering. We evaluate the assessment metrics predicted by GPT-4o and GPT-Whisper, examining their correlation with human-based quality and intelligibility assessments and the character error rate (CER) of automatic speech recognition. Experimental results show that GPT-4o alone is less effective for audio analysis, while GPT-Whisper achieves higher prediction accuracy, has moderate correlation with speech quality and intelligibility, and has higher correlation with CER. Compared to SpeechLMScore and DNSMOS, GPT-Whisper excels in intelligibility metrics, but performs slightly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Voice and Speech Disorders
