Reducing Object Hallucination in Large Audio-Language Models via Audio-Aware Decoding
Tzu-wen Hsu, Ke-Han Lu, Cheng-Han Chiang, Hung-yi Lee

TL;DR
This paper introduces Audio-Aware Decoding, a lightweight inference strategy for Large Audio-Language Models that reduces object hallucination by contrastively adjusting token predictions based on audio context, improving accuracy on multiple datasets.
Contribution
The paper proposes a novel contrastive decoding method, Audio-Aware Decoding, that mitigates hallucinations in LALMs during inference, enhancing their factual accuracy without retraining.
Findings
AAD improves F1 scores by 0.046 to 0.428 on object hallucination datasets.
AAD increases accuracy on Clotho-AQA by 5.4% to 10.3%.
Thorough ablation studies validate the effectiveness of AAD components.
Abstract
Large Audio-Language Models (LALMs) can take audio and text as the inputs and answer questions about the audio. While prior LALMs have shown strong performance on standard benchmarks, there has been alarming evidence that LALMs can hallucinate what is presented in the audio. To mitigate the hallucination of LALMs, we introduce Audio-Aware Decoding (AAD), a lightweight inference-time strategy that uses contrastive decoding to compare the token prediction logits with and without the audio context. By contrastive decoding, AAD promotes the tokens whose probability increases when the audio is present. We conduct our experiment on object hallucination datasets with three LALMs and show that AAD improves the F1 score by 0.046 to 0.428. We also show that AAD can improve the accuracy on general audio QA datasets like Clotho-AQA by 5.4% to 10.3%. We conduct thorough ablation studies to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis
