Can Large Audio-Language Models Truly Hear? Tackling Hallucinations with Multi-Task Assessment and Stepwise Audio Reasoning
Chun-Yi Kuan, Hung-yi Lee

TL;DR
This paper evaluates large audio-language models' understanding of audio through three tasks, revealing limitations and proposing a multi-turn reasoning approach to improve their ability to recognize sound events, order, and sources.
Contribution
It introduces three systematic audio comprehension tasks and a multi-turn reasoning method to enhance model accuracy in sound event recognition and attribution.
Findings
Models show limitations in recognizing sound events and sources.
Multi-turn reasoning improves task performance.
Evaluation highlights areas needing better model understanding.
Abstract
Recent advancements in large audio-language models (LALMs) have shown impressive capabilities in understanding and reasoning about audio and speech information. However, these models still face challenges, including hallucinating non-existent sound events, misidentifying the order of sound events, and incorrectly attributing sound sources, which undermine their reliability and real-world application. To systematically evaluate these issues, we propose three distinct tasks: object existence, temporal order, and object attribute within audio. These tasks assess the models' comprehension of critical audio information aspects. Our experimental results reveal limitations in these fundamental tasks, underscoring the need for better models in recognizing specific sound events, determining event sequences, and identifying sound sources. To improve performance in these areas, we introduce a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHearing Loss and Rehabilitation · Neuroscience and Music Perception · Music and Audio Processing
