AudioCapBench: Quick Evaluation on Audio Captioning across Sound, Music, and Speech

Jielin Qiu; Jianguo Zhang; Zixiang Chen; Liangwei Yang; Ming Zhu; Juntao Tan; Haolin Chen; Wenting Zhao; Rithesh Murthy; Roshan Ram; Akshara Prabhakar; Shelby Heinecke; Caiming; Xiong; Silvio Savarese; Huan Wang

arXiv:2602.23649·cs.SD·March 2, 2026

AudioCapBench: Quick Evaluation on Audio Captioning across Sound, Music, and Speech

Jielin Qiu, Jianguo Zhang, Zixiang Chen, Liangwei Yang, Ming Zhu, Juntao Tan, Haolin Chen, Wenting Zhao, Rithesh Murthy, Roshan Ram, Akshara Prabhakar, Shelby Heinecke, Caiming, Xiong, Silvio Savarese, Huan Wang

PDF

Open Access

TL;DR

AudioCapBench provides a comprehensive benchmark for evaluating large multimodal models' ability to generate accurate and complete captions across sound, music, and speech domains, using diverse metrics and evaluation frameworks.

Contribution

This paper introduces AudioCapBench, a new benchmark covering three audio domains with curated samples and multi-metric evaluation, including an LLM-based assessment framework.

Findings

01

Gemini models outperform OpenAI models in overall caption quality.

02

OpenAI models show lower hallucination rates.

03

Models perform best on speech and worst on music captioning.

Abstract

We introduce AudioCapBench, a benchmark for evaluating audio captioning capabilities of large multimodal models. \method covers three distinct audio domains, including environmental sound, music, and speech, with 1,000 curated evaluation samples drawn from established datasets. We evaluate 13 models across two providers (OpenAI, Google Gemini) using both reference-based metrics (METEOR, BLEU, ROUGE-L) and an LLM-as-Judge framework that scores predictions on three orthogonal dimensions: \textit{accuracy} (semantic correctness), \textit{completeness} (coverage of reference content), and \textit{hallucination} (absence of fabricated content). Our results reveal that Gemini models generally outperform OpenAI models on overall captioning quality, with Gemini~3~Pro achieving the highest overall score (6.00/10), while OpenAI models exhibit lower hallucination rates. All models perform best on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Multimodal Machine Learning Applications · Speech Recognition and Synthesis