AQAScore: Evaluating Semantic Alignment in Text-to-Audio Generation via Audio Question Answering

Chun-Yi Kuan; Kai-Wei Chang; Hung-yi Lee

arXiv:2601.14728·eess.AS·January 22, 2026

AQAScore: Evaluating Semantic Alignment in Text-to-Audio Generation via Audio Question Answering

Chun-Yi Kuan, Kai-Wei Chang, Hung-yi Lee

PDF

Open Access

TL;DR

AQAScore introduces a novel audio question answering framework using large language models to evaluate semantic alignment in text-to-audio generation, outperforming existing similarity-based metrics in correlation with human judgments.

Contribution

The paper presents AQAScore, a new evaluation method that reformulates semantic alignment assessment as a probabilistic verification task using audio-aware large language models.

Findings

01

AQAScore correlates better with human judgments than existing metrics.

02

It effectively captures subtle semantic inconsistencies.

03

AQAScore scales with the capabilities of underlying ALLMs.

Abstract

Although text-to-audio generation has made remarkable progress in realism and diversity, the development of evaluation metrics has not kept pace. Widely-adopted approaches, typically based on embedding similarity like CLAPScore, effectively measure general relevance but remain limited in fine-grained semantic alignment and compositional reasoning. To address this, we introduce AQAScore, a backbone-agnostic evaluation framework that leverages the reasoning capabilities of audio-aware large language models (ALLMs). AQAScore reformulates assessment as a probabilistic semantic verification task; rather than relying on open-ended text generation, it estimates alignment by computing the exact log-probability of a "Yes" answer to targeted semantic queries. We evaluate AQAScore across multiple benchmarks, including human-rated relevance, pairwise comparison, and compositional reasoning tasks.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Speech Recognition and Synthesis · Music and Audio Processing