Teaching Audio-Aware Large Language Models What Does Not Hear: Mitigating Hallucinations through Synthesized Negative Samples

Chun-Yi Kuan; Hung-yi Lee

arXiv:2505.14518·eess.AS·July 2, 2025

Teaching Audio-Aware Large Language Models What Does Not Hear: Mitigating Hallucinations through Synthesized Negative Samples

Chun-Yi Kuan, Hung-yi Lee

PDF

Open Access

TL;DR

This paper introduces LISTEN, a novel training method for audio-aware large language models that reduces hallucinations of non-existent sounds by using synthesized negative samples, without altering the core model.

Contribution

LISTEN is a contrastive-like training approach that enhances ALLMs' sound discrimination using synthesized data, requiring no changes to the LLM parameters and improving efficiency.

Findings

01

Effectively reduces hallucinations of non-existent sounds.

02

Maintains high performance on audio question and reasoning benchmarks.

03

More efficient in data and computation than prior methods.

Abstract

Recent advancements in audio-aware large language models (ALLMs) enable them to process and understand audio inputs. However, these models often hallucinate non-existent sound events, reducing their reliability in real-world applications. To address this, we propose LISTEN (Learning to Identify Sounds Through Extended Negative Samples), a contrastive-like training method that enhances ALLMs' ability to distinguish between present and absent sounds using synthesized data from the backbone LLM. Unlike prior approaches, our method requires no modification to LLM parameters and efficiently integrates audio representations via a lightweight adapter. Experiments show that LISTEN effectively mitigates hallucinations while maintaining impressive performance on existing audio question and reasoning benchmarks. At the same time, it is more efficient in both data and computation.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Emotion and Mood Recognition · Explainable Artificial Intelligence (XAI)