AQUA-Bench: Beyond Finding Answers to Knowing When There Are None in Audio Question Answering
Chun-Yi Kuan, Hung-yi Lee

TL;DR
AQUA-Bench introduces a comprehensive benchmark for assessing the ability of audio question answering models to detect unanswerable questions, addressing a critical gap in current evaluation methods.
Contribution
It systematically evaluates unanswerability in audio question answering through three scenarios, promoting more robust and trustworthy audio-language systems.
Findings
Models perform well on answerable questions but struggle with unanswerable cases.
Current benchmarks overlook the challenge of unanswerable questions in audio QA.
AQUA-Bench provides a rigorous measure of model reliability in real-world settings.
Abstract
Recent advances in audio-aware large language models have shown strong performance on audio question answering. However, existing benchmarks mainly cover answerable questions and overlook the challenge of unanswerable ones, where no reliable answer can be inferred from the audio. Such cases are common in real-world settings, where questions may be misleading, ill-posed, or incompatible with the information. To address this gap, we present AQUA-Bench, a benchmark for Audio Question Unanswerability Assessment. It systematically evaluates three scenarios: Absent Answer Detection (the correct option is missing), Incompatible Answer Set Detection (choices are categorically mismatched with the question), and Incompatible Audio Question Detection (the question is irrelevant or lacks sufficient grounding in the audio). By assessing these cases, AQUA-Bench offers a rigorous measure of model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
