Can Large Audio-Language Models Truly Hear? Tackling Hallucinations with   Multi-Task Assessment and Stepwise Audio Reasoning

Chun-Yi Kuan; Hung-yi Lee

arXiv:2410.16130·eess.AS·January 3, 2025

Can Large Audio-Language Models Truly Hear? Tackling Hallucinations with Multi-Task Assessment and Stepwise Audio Reasoning

Chun-Yi Kuan, Hung-yi Lee

PDF

Open Access 1 Repo

TL;DR

This paper evaluates large audio-language models' understanding of audio through three tasks, revealing limitations and proposing a multi-turn reasoning approach to improve their ability to recognize sound events, order, and sources.

Contribution

It introduces three systematic audio comprehension tasks and a multi-turn reasoning method to enhance model accuracy in sound event recognition and attribution.

Findings

01

Models show limitations in recognizing sound events and sources.

02

Multi-turn reasoning improves task performance.

03

Evaluation highlights areas needing better model understanding.

Abstract

Recent advancements in large audio-language models (LALMs) have shown impressive capabilities in understanding and reasoning about audio and speech information. However, these models still face challenges, including hallucinating non-existent sound events, misidentifying the order of sound events, and incorrectly attributing sound sources, which undermine their reliability and real-world application. To systematically evaluate these issues, we propose three distinct tasks: object existence, temporal order, and object attribute within audio. These tasks assess the models' comprehension of critical audio information aspects. Our experimental results reveal limitations in these fundamental tasks, underscoring the need for better models in recognizing specific sound events, determining event sequences, and identifying sound sources. To improve performance in these areas, we introduce a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kuan2jiu99/audio-hallucination
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHearing Loss and Rehabilitation · Neuroscience and Music Perception · Music and Audio Processing