A Survey of Audio Reasoning in Multimodal Foundation Models
Zhihan Guo, Wenqian Cui, Guan-Ting Lin, Daxin Tan, Jingyao Li, Qiyong Zheng, Dingdong Wang, Jing Xiong, Han Shi, Jiaya Jia, Irwin King

TL;DR
This paper provides the first comprehensive survey of audio reasoning in multimodal foundation models, discussing challenges, recent advances, and future directions in the field.
Contribution
It offers a unified framework, reviews recent progress, and discusses emerging paradigms and open challenges in audio reasoning for foundation models.
Findings
Identifies key obstacles like data scarcity and modality hallucination.
Reviews architectural and training foundations of audio reasoning models.
Organizes recent advances across multiple audio reasoning tasks.
Abstract
Reasoning has become a defining capability of modern foundation models, yet its development in the audio modality remains limited. Audio poses challenges that are distinct from those of text and vision. It is continuous, temporally dense, and contains linguistic, paralinguistic, and environmental information at multiple time scales. As a result, audio reasoning models must align acoustic signals with the discrete semantic space of large language models, while still preserving fine-grained information needed for reliable inference. Progress is also limited by three major obstacles: the scarcity of genuinely audio-grounded reasoning data, shortcut learning and modality hallucination, and the tension between reasoning depth and real-time latency in spoken interaction. In this paper, we present the first dedicated survey of audio reasoning. We provide a unified formulation that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
