AUDITA: A New Dataset to Audit Humans vs. AI Skill at Audio QA
Tasnim Kabir, Dmytro Kurdydyk, Aadi Palnitkar, Liam Dorn, Ahmed Haj Ahmed, Jordan Lee Boyd-Graber

TL;DR
AUDITA introduces a challenging, real-world audio QA dataset designed to evaluate genuine auditory reasoning, revealing significant gaps in current models' capabilities.
Contribution
The paper presents AUDITA, a novel large-scale dataset with human-authored trivia questions that challenge models to perform robust audio reasoning beyond surface cues.
Findings
Human accuracy is 32.13%, indicating the task's difficulty.
State-of-the-art models achieve below 8.86% accuracy, highlighting current limitations.
IRT analysis exposes systematic deficiencies in models and data.
Abstract
Existing audio question answering benchmarks largely emphasize sound event classification or caption-grounded queries, often enabling models to succeed through shortcut strategies, short-duration cues, lexical priors, dataset-specific biases, or even bypassing audio via metadata and captions rather than genuine reasoning Thus, we present AUDITA (Audio Understanding from Diverse Internet Trivia Authors), a large-scale, real-world benchmark to rigorously evaluate audio reasoning beyond surface-level acoustic recognition. AUDITA comprises carefully curated, human-authored trivia questions grounded in real-world audio, designed to stress robust auditory reasoning through challenging distractors and long-range temporal dependencies, using probing queries that cannot be answered from isolated text or sound cues alone. Human average accuracy of 32.13% shows both the challenge of the task while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
