AUDITA: A New Dataset to Audit Humans vs. AI Skill at Audio QA

Tasnim Kabir; Dmytro Kurdydyk; Aadi Palnitkar; Liam Dorn; Ahmed Haj Ahmed; Jordan Lee Boyd-Graber

arXiv:2604.21766·cs.CL·April 24, 2026

AUDITA: A New Dataset to Audit Humans vs. AI Skill at Audio QA

Tasnim Kabir, Dmytro Kurdydyk, Aadi Palnitkar, Liam Dorn, Ahmed Haj Ahmed, Jordan Lee Boyd-Graber

PDF

TL;DR

AUDITA introduces a challenging, real-world audio QA dataset designed to evaluate genuine auditory reasoning, revealing significant gaps in current models' capabilities.

Contribution

The paper presents AUDITA, a novel large-scale dataset with human-authored trivia questions that challenge models to perform robust audio reasoning beyond surface cues.

Findings

01

Human accuracy is 32.13%, indicating the task's difficulty.

02

State-of-the-art models achieve below 8.86% accuracy, highlighting current limitations.

03

IRT analysis exposes systematic deficiencies in models and data.

Abstract

Existing audio question answering benchmarks largely emphasize sound event classification or caption-grounded queries, often enabling models to succeed through shortcut strategies, short-duration cues, lexical priors, dataset-specific biases, or even bypassing audio via metadata and captions rather than genuine reasoning Thus, we present AUDITA (Audio Understanding from Diverse Internet Trivia Authors), a large-scale, real-world benchmark to rigorously evaluate audio reasoning beyond surface-level acoustic recognition. AUDITA comprises carefully curated, human-authored trivia questions grounded in real-world audio, designed to stress robust auditory reasoning through challenging distractors and long-range temporal dependencies, using probing queries that cannot be answered from isolated text or sound cues alone. Human average accuracy of 32.13% shows both the challenge of the task while…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.