Hearing Between the Lines: Unlocking the Reasoning Power of LLMs for Speech Evaluation
Arjun Chandra, Kevin Miller, Venkatesh Ravichandran, Constantinos Papayiannis, Venkatesh Saligrama

TL;DR
This paper introduces TRACE, a cost-effective framework that enables large language models to evaluate speech quality by reasoning over audio cues through textual representations, improving alignment with human judgments.
Contribution
The paper presents TRACE, a novel method allowing LLMs to assess speech by reasoning over audio cues via textual blueprints, reducing reliance on expensive audio models.
Findings
TRACE outperforms ALMs and transcript-only LLM judges in agreement with human raters.
The framework is more cost-effective than traditional audio-based evaluation methods.
HCoT annotations improve diagnostic capability of speech evaluation benchmarks.
Abstract
Large Language Model (LLM) judges exhibit strong reasoning capabilities but are limited to textual content. This leaves current automatic Speech-to-Speech (S2S) evaluation methods reliant on opaque and expensive Audio Language Models (ALMs). In this work, we propose TRACE (Textual Reasoning over Audio Cues for Evaluation), a novel framework that enables LLM judges to reason over audio cues to achieve cost-efficient and human-aligned S2S evaluation. To demonstrate the strength of the framework, we first introduce a Human Chain-of-Thought (HCoT) annotation protocol to improve the diagnostic capability of existing judge benchmarks by separating evaluation into explicit dimensions: content (C), voice quality (VQ), and paralinguistics (P). Using this data, TRACE constructs a textual blueprint of inexpensive audio signals and prompts an LLM to render dimension-wise judgments, fusing them into…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Voice and Speech Disorders
