Audio-Language Models for Audio-Centric Tasks: A Systematic Survey

Yi Su; Jisheng Bai; Qisheng Xu; Kele Xu; Yong Dou

arXiv:2501.15177·cs.SD·March 13, 2026

Audio-Language Models for Audio-Centric Tasks: A Systematic Survey

Yi Su, Jisheng Bai, Qisheng Xu, Kele Xu, Yong Dou

PDF

Open Access

TL;DR

This systematic survey reviews Audio-Language Models (ALMs), highlighting their architectures, training methods, and applications across audio domains, emphasizing their zero-shot capabilities and identifying future research directions.

Contribution

The paper provides the first comprehensive survey of ALMs, including a unified taxonomy and analysis of their development, limitations, and future trends in audio-centric tasks.

Findings

01

ALMs demonstrate strong zero-shot and generalization abilities.

02

A unified taxonomy of ALM architectures and training objectives.

03

Identification of current limitations and promising future research directions.

Abstract

Audio-Language Models (ALMs), trained on paired audio-text data, are designed to process, understand, and reason about audio-centric multimodal content. Unlike traditional supervised approaches that use predefined labels, ALMs leverage natural language supervision to better handle complex real-world audio scenes with multiple overlapping events. While demonstrating impressive zero-shot and task generalization capabilities, there is still a notable lack of systematic surveys that comprehensively organize and analyze developments. In this paper, we present the first systematic review of ALMs with three main contributions: (1) comprehensive coverage of ALM works across speech, music, and sound from a general audio perspective; (2) a unified taxonomy of ALM foundations, including model architectures and training objectives; (3) establishment of a research landscape capturing mutual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Music Technology and Sound Studies