Audio-Aware Large Language Models as Judges for Speaking Styles

Cheng-Han Chiang; Xiaofei Wang; Chung-Ching Lin; Kevin Lin; Linjie Li; Radu Kopetz; Yao Qian; Zhendong Wang; Zhengyuan Yang; Hung-yi Lee; Lijuan Wang

arXiv:2506.05984·eess.AS·June 9, 2025

Audio-Aware Large Language Models as Judges for Speaking Styles

Cheng-Han Chiang, Xiaofei Wang, Chung-Ching Lin, Kevin Lin, Linjie Li, Radu Kopetz, Yao Qian, Zhendong Wang, Zhengyuan Yang, Hung-yi Lee, Lijuan Wang

PDF

Open Access 1 Datasets 1 Video

TL;DR

This paper demonstrates that audio-aware large language models can effectively evaluate speaking styles in speech generation, showing promise as automatic judges comparable to human evaluators.

Contribution

It introduces the use of ALLMs as automatic judges for speaking styles, comparing their assessments with human judgments and highlighting their potential in speech evaluation.

Findings

01

Gemini-2.5-pro's judgments align with human evaluations.

02

ALLMs reveal current SLMs' limitations in style control.

03

ALLMs show promise as reliable speech style evaluators.

Abstract

Audio-aware large language models (ALLMs) can understand the textual and non-textual information in the audio input. In this paper, we explore using ALLMs as an automatic judge to assess the speaking styles of speeches. We use ALLM judges to evaluate the speeches generated by SLMs on two tasks: voice style instruction following and role-playing. The speaking style we consider includes emotion, volume, speaking pace, word emphasis, pitch control, and non-verbal elements. We use four spoken language models (SLMs) to complete the two tasks and use humans and ALLMs to judge the SLMs' responses. We compare two ALLM judges, GPT-4o-audio and Gemini-2.5-pro, with human evaluation results and show that the agreement between Gemini and human judges is comparable to the agreement between human evaluators. These promising results show that ALLMs can be used as a judge to evaluate SLMs. Our results…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

dcml0714/StyleSet
dataset· 119 dl
119 dl

Videos

Audio-Aware Large Language Models as Judges for Speaking Styles· underline

Taxonomy

TopicsEmotion and Mood Recognition · Speech Recognition and Synthesis · Music and Audio Processing