Towards General Auditory Intelligence: Large Multimodal Models for Machine Listening and Speaking

Siyin Wang; Zengrui Jin; Changli Tang; Qiujia Li; Bo Li; Chen Chen; Yuchen Hu; Wenyi Yu; Yixuan Li; Jimin Zhuang; Yudong Yang; Mingqiu Wang; Michael Han; Yifan Ding; Junwen Bai; Tom Ouyang; Shuo-yiin Chang; Xianzhao Chen; Xiaohai Tian; Jun Zhang; Lu Lu; Guangzhi Sun; Zhehuai Chen; Ji Wu; Bowen Zhou; Yuxuan Wang; Tara Sainath; Yonghui Wu; Chao Zhang

arXiv:2511.01299·eess.AS·November 4, 2025

Towards General Auditory Intelligence: Large Multimodal Models for Machine Listening and Speaking

Siyin Wang, Zengrui Jin, Changli Tang, Qiujia Li, Bo Li, Chen Chen, Yuchen Hu, Wenyi Yu, Yixuan Li, Jimin Zhuang, Yudong Yang, Mingqiu Wang, Michael Han, Yifan Ding, Junwen Bai, Tom Ouyang, Shuo-yiin Chang, Xianzhao Chen, Xiaohai Tian, Jun Zhang, Lu Lu, Guangzhi Sun

PDF

Open Access 1 Datasets

TL;DR

This paper reviews recent advances in large multimodal models for machine listening and speaking, emphasizing audio comprehension, generation, speech interaction, and audio-visual understanding to move towards general auditory intelligence.

Contribution

It provides a comprehensive survey of integrating audio into large language models, highlighting recent progress, challenges, and future directions for audio-native AGI systems.

Findings

01

LLMs are transforming audio perception and reasoning.

02

Multimodal fusion enhances situational awareness.

03

Current challenges include deep semantic understanding and natural interaction.

Abstract

In the era of large language models (LLMs) and artificial general intelligence (AGI), computer audition must evolve beyond traditional paradigms to fully leverage the capabilities of foundation models, towards more comprehensive understanding, more natural generation and more human-like interaction. Audio, as a modality rich in semantic, emotional, and contextual cues, plays a vital role in achieving naturalistic and embodied machine intelligence. This survey provides a comprehensive review of recent progress in integrating audio into LLMs, with a focus on four key areas: audio comprehension, audio generation, speech-based interaction, and audio-visual understanding. We analyze how LLMs are reshaping audio perception and reasoning, enabling systems to understand sound at a deeper semantic level, generate expressive audio outputs, and engage in human-like spoken interaction. Furthermore,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Kylan12/Synthetic-AI-ML-Dataset
dataset· 42 dl
42 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Multimodal Machine Learning Applications · Music and Audio Processing