AQAScore: Evaluating Semantic Alignment in Text-to-Audio Generation via Audio Question Answering
Chun-Yi Kuan, Kai-Wei Chang, Hung-yi Lee

TL;DR
AQAScore introduces a novel audio question answering framework using large language models to evaluate semantic alignment in text-to-audio generation, outperforming existing similarity-based metrics in correlation with human judgments.
Contribution
The paper presents AQAScore, a new evaluation method that reformulates semantic alignment assessment as a probabilistic verification task using audio-aware large language models.
Findings
AQAScore correlates better with human judgments than existing metrics.
It effectively captures subtle semantic inconsistencies.
AQAScore scales with the capabilities of underlying ALLMs.
Abstract
Although text-to-audio generation has made remarkable progress in realism and diversity, the development of evaluation metrics has not kept pace. Widely-adopted approaches, typically based on embedding similarity like CLAPScore, effectively measure general relevance but remain limited in fine-grained semantic alignment and compositional reasoning. To address this, we introduce AQAScore, a backbone-agnostic evaluation framework that leverages the reasoning capabilities of audio-aware large language models (ALLMs). AQAScore reformulates assessment as a probabilistic semantic verification task; rather than relying on open-ended text generation, it estimates alignment by computing the exact log-probability of a "Yes" answer to targeted semantic queries. We evaluate AQAScore across multiple benchmarks, including human-rated relevance, pairwise comparison, and compositional reasoning tasks.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Speech Recognition and Synthesis · Music and Audio Processing
