AHA: Aligning Large Audio-Language Models for Reasoning Hallucinations via Counterfactual Hard Negatives

Yanxi Chen; Wenhui Zhu; Xiwen Chen; Zhipeng Wang; Xin Li; Peijie Qiu; Hao Wang; Xuanzhao Dong; Yujian Xiong; Anderson Schneider; Yuriy Nevmyvaka; Yalin Wang

arXiv:2512.24052·cs.SD·January 6, 2026

AHA: Aligning Large Audio-Language Models for Reasoning Hallucinations via Counterfactual Hard Negatives

Yanxi Chen, Wenhui Zhu, Xiwen Chen, Zhipeng Wang, Xin Li, Peijie Qiu, Hao Wang, Xuanzhao Dong, Yujian Xiong, Anderson Schneider, Yuriy Nevmyvaka, Yalin Wang

PDF

Open Access 1 Models 1 Datasets

TL;DR

This paper introduces the AHA framework to reduce hallucinations in large audio-language models by using counterfactual hard negatives, improving temporal reasoning and grounding accuracy.

Contribution

The paper presents a novel training pipeline with counterfactual hard negative mining and a diagnostic benchmark for temporal reasoning in audio-language models.

Findings

01

13.7% improvement on AHA-Eval

02

Gains on public benchmarks MMAU-Test and MMAR

03

Outperforms latest SOTA methods

Abstract

Although Large Audio-Language Models (LALMs) deliver state-of-the-art (SOTA) performance, they frequently suffer from hallucinations, e.g. generating text not grounded in the audio input. We analyze these grounding failures and identify a distinct taxonomy: Event Omission, False Event Identity, Temporal Relation Error, and Quantitative Temporal Error. To address this, we introduce the AHA (Audio Hallucination Alignment) framework. By leveraging counterfactual hard negative mining, our pipeline constructs a high-quality preference dataset that forces models to distinguish strict acoustic evidence from linguistically plausible fabrications. Additionally, we establish AHA-Eval, a diagnostic benchmark designed to rigorously test these fine-grained temporal reasoning capabilities. We apply this data to align Qwen2.5-Omni. The resulting model, Qwen-Audio-AHA, achieves a 13.7% improvement on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
ASU-GSL/Qwen-Audio-AHA
model· 53 dl
53 dl

Datasets

ASU-GSL/AHA
dataset· 33 dl
33 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Adversarial Robustness in Machine Learning · Emotion and Mood Recognition