HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models
Feiyu Zhao, Yiming Chen, Wenhuan Lu, Daipeng Zhang, Xianghu Yue, Jianguo Wei

TL;DR
HalluAudio is a large-scale benchmark designed to evaluate hallucination detection in large audio-language models across diverse audio tasks, revealing key deficiencies in current models.
Contribution
This paper introduces the first comprehensive benchmark for hallucination detection in LALMs covering speech, sound, and music, with detailed evaluation protocols and extensive model comparisons.
Findings
Models show significant deficiencies in acoustic grounding.
Temporal reasoning and music attribute understanding are weak.
Benchmark reveals need for more robust LALMs.
Abstract
Large Audio-Language Models (LALMs) have recently achieved strong performance across various audio-centric tasks. However, hallucination, where models generate responses that are semantically incorrect or acoustically unsupported, remains largely underexplored in the audio domain. Existing hallucination benchmarks mainly focus on text or vision, while the few audio-oriented studies are limited in scale, modality coverage, and diagnostic depth. We therefore introduce HalluAudio, the first large-scale benchmark for evaluating hallucinations across speech, environmental sound, and music. HalluAudio comprises over 5K human-verified QA pairs and spans diverse task types, including binary judgments, multi-choice reasoning, attribute verification, and open-ended QA. To systematically induce hallucinations, we design adversarial prompts and mixed-audio conditions. Beyond accuracy, our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
