Audio-Language Models for Audio-Centric Tasks: A Systematic Survey
Yi Su, Jisheng Bai, Qisheng Xu, Kele Xu, Yong Dou

TL;DR
This systematic survey reviews Audio-Language Models (ALMs), highlighting their architectures, training methods, and applications across audio domains, emphasizing their zero-shot capabilities and identifying future research directions.
Contribution
The paper provides the first comprehensive survey of ALMs, including a unified taxonomy and analysis of their development, limitations, and future trends in audio-centric tasks.
Findings
ALMs demonstrate strong zero-shot and generalization abilities.
A unified taxonomy of ALM architectures and training objectives.
Identification of current limitations and promising future research directions.
Abstract
Audio-Language Models (ALMs), trained on paired audio-text data, are designed to process, understand, and reason about audio-centric multimodal content. Unlike traditional supervised approaches that use predefined labels, ALMs leverage natural language supervision to better handle complex real-world audio scenes with multiple overlapping events. While demonstrating impressive zero-shot and task generalization capabilities, there is still a notable lack of systematic surveys that comprehensively organize and analyze developments. In this paper, we present the first systematic review of ALMs with three main contributions: (1) comprehensive coverage of ALM works across speech, music, and sound from a general audio perspective; (2) a unified taxonomy of ALM foundations, including model architectures and training objectives; (3) establishment of a research landscape capturing mutual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies
