Towards Holistic Evaluation of Large Audio-Language Models: A Comprehensive Survey

Chih-Kai Yang; Neo S. Ho; Hung-yi Lee

arXiv:2505.15957·eess.AS·April 28, 2026

Towards Holistic Evaluation of Large Audio-Language Models: A Comprehensive Survey

Chih-Kai Yang, Neo S. Ho, Hung-yi Lee

PDF

1 Repo 1 Video

TL;DR

This paper presents the first comprehensive survey and taxonomy for evaluating large audio-language models, addressing fragmentation in existing benchmarks and guiding future research.

Contribution

It introduces a systematic taxonomy for LALM evaluation, categorizes existing benchmarks, and offers insights and guidelines for the community.

Findings

01

Proposes four evaluation dimensions for LALMs.

02

Highlights challenges and future directions in LALM evaluation.

03

Plans to release a curated collection of relevant papers.

Abstract

With advancements in large audio-language models (LALMs), which enhance large language models (LLMs) with auditory capabilities, these models are expected to demonstrate universal proficiency across various auditory tasks. While numerous benchmarks have emerged to assess LALMs' performance, they remain fragmented and lack a structured taxonomy. To bridge this gap, we conduct a comprehensive survey and propose a systematic taxonomy for LALM evaluations, categorizing them into four dimensions based on their objectives: (1) General Auditory Awareness and Processing, (2) Knowledge and Reasoning, (3) Dialogue-oriented Ability, and (4) Fairness, Safety, and Trustworthiness. We provide detailed overviews within each category and highlight challenges in this field, offering insights into promising future directions. To the best of our knowledge, this is the first survey specifically focused on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ckyang1124/LALM-Evaluation-Survey
github

Videos

Towards Holistic Evaluation of Large Audio-Language Models: A Comprehensive Survey· underline