TL;DR
This paper investigates integrating Chain-of-Thought reasoning into Large Audio-Language Models to improve their complex reasoning abilities across audio tasks, revealing both benefits and limitations of current methods.
Contribution
First exploration of Chain-of-Thought reasoning in Large Audio-Language Models, analyzing its impact and challenges across auditory perception and understanding tasks.
Findings
CoT methods improve performance on easy and medium tasks
Reasoning chain length positively correlates with accuracy
Challenges remain in hard tasks where reasoning can cause confusion
Abstract
Large Audio-Language Models (LALMs) have demonstrated remarkable performance in tasks involving audio perception and understanding, such as speech recognition and audio captioning. However, their reasoning capabilities - critical for solving complex real-world problems - remain underexplored. In this work, we conduct the first exploration into integrating Chain-of-Thought (CoT) reasoning into LALMs to enhance their reasoning ability across auditory modalities. We evaluate representative CoT methods, analyzing their performance in both information extraction and reasoning tasks across sound, music, and speech domains. Our findings reveal that CoT methods significantly improve performance on easy and medium tasks but encounter challenges with hard tasks, where reasoning chains can confuse the model rather than improve accuracy. Additionally, we identify a positive correlation between…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
