IFEval-Audio: Benchmarking Instruction-Following Capability in Audio-based Large Language Models

Yiming Gao; Bin Wang; Chengwei Wei; Shuo Sun; AiTi Aw

arXiv:2505.16774·cs.CL·November 13, 2025

IFEval-Audio: Benchmarking Instruction-Following Capability in Audio-based Large Language Models

Yiming Gao, Bin Wang, Chengwei Wei, Shuo Sun, AiTi Aw

PDF

Open Access 1 Repo

TL;DR

This paper introduces IFEval-Audio, a new benchmark dataset for evaluating instruction-following abilities of audio-based large language models across diverse tasks, addressing a gap in multimodal model assessment.

Contribution

The paper presents IFEval-Audio, the first comprehensive benchmark dataset for assessing instruction-following in audio LLMs, facilitating future research in multimodal instruction understanding.

Findings

01

State-of-the-art audio LLMs show varied performance across instruction types.

02

The dataset reveals specific challenges in following structured audio instructions.

03

Benchmark results highlight areas for improvement in audio multimodal models.

Abstract

Large language models (LLMs) have demonstrated strong instruction-following capabilities in text-based tasks. However, this ability often deteriorates in multimodal models after alignment with non-text modalities such as images or audio. While several recent efforts have investigated instruction-following performance in text and vision-language models, instruction-following in audio-based large language models remains largely unexplored. To bridge this gap, we introduce IFEval-Audio, a novel evaluation dataset designed to assess the ability to follow instructions in an audio LLM. IFEval-Audio contains 280 audio-instruction-answer triples across six diverse dimensions: Content, Capitalization, Symbol, List Structure, Length, and Format. Each example pairs an audio input with a text instruction, requiring the model to generate an output that follows a specified structure. We benchmark…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

audiollms/audiobench
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Generative Adversarial Networks and Image Synthesis