BRACE: A Benchmark for Robust Audio Caption Quality Evaluation

Tianyu Guo; Hongyu Chen; Hao Liang; Meiyi Qiang; Bohan Zeng; Linzhuang Sun; Bin Cui; Wentao Zhang

arXiv:2512.10403·cs.SD·December 12, 2025

BRACE: A Benchmark for Robust Audio Caption Quality Evaluation

Tianyu Guo, Hongyu Chen, Hao Liang, Meiyi Qiang, Bohan Zeng, Linzhuang Sun, Bin Cui, Wentao Zhang

PDF

Open Access

TL;DR

BRACE is a new benchmark for evaluating the quality of audio captions in reference-free settings, revealing limitations of current models and guiding future research.

Contribution

We introduce BRACE, a comprehensive benchmark with datasets for assessing audio caption alignment and hallucination detection, specifically designed for reference-free evaluation.

Findings

01

CLAPScore achieves only 70.01 F1-score on BRACE-Main.

02

The best LALM reaches just 63.19 F1-score.

03

Current models have significant room for improvement in audio caption evaluation.

Abstract

Automatic audio captioning is essential for audio understanding, enabling applications such as accessibility and content indexing. However, evaluating the quality of audio captions remains a major challenge, especially in reference-free settings where high-quality ground-truth captions are unavailable. While CLAPScore is currently the most widely used reference-free Audio Caption Evaluation Metric(ACEM), its robustness under diverse conditions has not been systematically validated. To address this gap, we introduce BRACE, a new benchmark designed to evaluate audio caption alignment quality in a reference-free setting. BRACE is primarily designed for assessing ACEMs, and can also be extended to measure the modality alignment abilities of Large Audio Language Model(LALM). BRACE consists of two sub-benchmarks: BRACE-Main for fine-grained caption comparison and BRACE-Hallucination for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSubtitles and Audiovisual Media · Music and Audio Processing · Speech Recognition and Synthesis