BRACE: A Benchmark for Robust Audio Caption Quality Evaluation
Tianyu Guo, Hongyu Chen, Hao Liang, Meiyi Qiang, Bohan Zeng, Linzhuang Sun, Bin Cui, Wentao Zhang

TL;DR
BRACE is a new benchmark for evaluating the quality of audio captions in reference-free settings, revealing limitations of current models and guiding future research.
Contribution
We introduce BRACE, a comprehensive benchmark with datasets for assessing audio caption alignment and hallucination detection, specifically designed for reference-free evaluation.
Findings
CLAPScore achieves only 70.01 F1-score on BRACE-Main.
The best LALM reaches just 63.19 F1-score.
Current models have significant room for improvement in audio caption evaluation.
Abstract
Automatic audio captioning is essential for audio understanding, enabling applications such as accessibility and content indexing. However, evaluating the quality of audio captions remains a major challenge, especially in reference-free settings where high-quality ground-truth captions are unavailable. While CLAPScore is currently the most widely used reference-free Audio Caption Evaluation Metric(ACEM), its robustness under diverse conditions has not been systematically validated. To address this gap, we introduce BRACE, a new benchmark designed to evaluate audio caption alignment quality in a reference-free setting. BRACE is primarily designed for assessing ACEMs, and can also be extended to measure the modality alignment abilities of Large Audio Language Model(LALM). BRACE consists of two sub-benchmarks: BRACE-Main for fine-grained caption comparison and BRACE-Hallucination for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSubtitles and Audiovisual Media · Music and Audio Processing · Speech Recognition and Synthesis
