AudioJudge: Understanding What Works in Large Audio Model Based Speech Evaluation

Potsawee Manakul; Woody Haosheng Gan; Michael J. Ryan; Ali Sartaz Khan; Warit Sirichotedumrong; Kunat Pipatanakul; William Held; Diyi Yang

arXiv:2507.12705·cs.CL·July 18, 2025

AudioJudge: Understanding What Works in Large Audio Model Based Speech Evaluation

Potsawee Manakul, Woody Haosheng Gan, Michael J. Ryan, Ali Sartaz Khan, Warit Sirichotedumrong, Kunat Pipatanakul, William Held, Diyi Yang

PDF

Open Access 3 Datasets 1 Video

TL;DR

This paper investigates the use of Large Audio Models as a unified speech evaluation tool, demonstrating improved performance in audio characteristic detection and human preference simulation, with high correlation to human judgments.

Contribution

It introduces AudioJudge, a systematic framework leveraging large audio models with prompt engineering and multi-aspect ensemble methods for comprehensive speech evaluation.

Findings

01

AudioJudge achieves up to 0.91 Spearman correlation with human preferences.

02

Prompt engineering with audio concatenation enhances detection and preference tasks.

03

Multi-aspect ensemble improves general-purpose audio evaluation.

Abstract

Current speech evaluation suffers from two critical limitations: the need and difficulty of designing specialized systems targeting individual audio characteristics, and poor correlation between automatic evaluation methods and human preferences. This work presents a systematic study of Large Audio Model (LAM) as a Judge, AudioJudge, investigating whether it can provide a unified evaluation framework that addresses both challenges. We systematically explore AudioJudge across audio characteristic detection tasks, including pronunciation, speaking rate, speaker identification and speech quality, and system-level human preference simulation for automated benchmarking. We investigate different prompt engineering strategies, finding that audio concatenation combined with in-context learning significantly improves performance across both audio characteristic detection and human preference…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

AudioJudge: Understanding What Works in Large Audio Model Based Speech Evaluation· underline

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis