AudioJudge: Understanding What Works in Large Audio Model Based Speech Evaluation
Potsawee Manakul, Woody Haosheng Gan, Michael J. Ryan, Ali Sartaz Khan, Warit Sirichotedumrong, Kunat Pipatanakul, William Held, Diyi Yang

TL;DR
This paper investigates the use of Large Audio Models as a unified speech evaluation tool, demonstrating improved performance in audio characteristic detection and human preference simulation, with high correlation to human judgments.
Contribution
It introduces AudioJudge, a systematic framework leveraging large audio models with prompt engineering and multi-aspect ensemble methods for comprehensive speech evaluation.
Findings
AudioJudge achieves up to 0.91 Spearman correlation with human preferences.
Prompt engineering with audio concatenation enhances detection and preference tasks.
Multi-aspect ensemble improves general-purpose audio evaluation.
Abstract
Current speech evaluation suffers from two critical limitations: the need and difficulty of designing specialized systems targeting individual audio characteristics, and poor correlation between automatic evaluation methods and human preferences. This work presents a systematic study of Large Audio Model (LAM) as a Judge, AudioJudge, investigating whether it can provide a unified evaluation framework that addresses both challenges. We systematically explore AudioJudge across audio characteristic detection tasks, including pronunciation, speaking rate, speaker identification and speech quality, and system-level human preference simulation for automated benchmarking. We investigate different prompt engineering strategies, finding that audio concatenation combined with in-context learning significantly improves performance across both audio characteristic detection and human preference…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis
