A Study on Zero-shot Non-intrusive Speech Assessment using Large   Language Models

Ryandhimas E. Zezario; Sabato M. Siniscalchi; Hsin-Min Wang; Yu Tsao

arXiv:2409.09914·eess.AS·January 22, 2025

A Study on Zero-shot Non-intrusive Speech Assessment using Large Language Models

Ryandhimas E. Zezario, Sabato M. Siniscalchi, Hsin-Min Wang, Yu Tsao

PDF

Open Access

TL;DR

This paper explores zero-shot non-intrusive speech assessment using large language models, comparing GPT-4o and GPT-Whisper, and demonstrates GPT-Whisper's superior correlation with human assessments and automatic speech recognition metrics.

Contribution

It introduces GPT-Whisper, a novel zero-shot speech assessment method combining Whisper and prompt engineering, showing improved accuracy over existing models without additional training.

Findings

01

GPT-Whisper outperforms GPT-4o in speech assessment accuracy.

02

GPT-Whisper has higher correlation with human judgments of speech quality and intelligibility.

03

GPT-Whisper surpasses supervised models in CER correlation.

Abstract

This work investigates two strategies for zero-shot non-intrusive speech assessment leveraging large language models. First, we explore the audio analysis capabilities of GPT-4o. Second, we propose GPT-Whisper, which uses Whisper as an audio-to-text module and evaluates the naturalness of text via targeted prompt engineering. We evaluate the assessment metrics predicted by GPT-4o and GPT-Whisper, examining their correlation with human-based quality and intelligibility assessments and the character error rate (CER) of automatic speech recognition. Experimental results show that GPT-4o alone is less effective for audio analysis, while GPT-Whisper achieves higher prediction accuracy, has moderate correlation with speech quality and intelligibility, and has higher correlation with CER. Compared to SpeechLMScore and DNSMOS, GPT-Whisper excels in intelligibility metrics, but performs slightly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Voice and Speech Disorders