TL;DR
This paper presents the first comprehensive survey and taxonomy for evaluating large audio-language models, addressing fragmentation in existing benchmarks and guiding future research.
Contribution
It introduces a systematic taxonomy for LALM evaluation, categorizes existing benchmarks, and offers insights and guidelines for the community.
Findings
Proposes four evaluation dimensions for LALMs.
Highlights challenges and future directions in LALM evaluation.
Plans to release a curated collection of relevant papers.
Abstract
With advancements in large audio-language models (LALMs), which enhance large language models (LLMs) with auditory capabilities, these models are expected to demonstrate universal proficiency across various auditory tasks. While numerous benchmarks have emerged to assess LALMs' performance, they remain fragmented and lack a structured taxonomy. To bridge this gap, we conduct a comprehensive survey and propose a systematic taxonomy for LALM evaluations, categorizing them into four dimensions based on their objectives: (1) General Auditory Awareness and Processing, (2) Knowledge and Reasoning, (3) Dialogue-oriented Ability, and (4) Fairness, Safety, and Trustworthiness. We provide detailed overviews within each category and highlight challenges in this field, offering insights into promising future directions. To the best of our knowledge, this is the first survey specifically focused on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
