MACE: Leveraging Audio for Evaluating Audio Captioning Systems
Satvik Dixit, Soham Deshmukh, Bhiksha Raj

TL;DR
MACE is a new evaluation metric for audio captioning that combines audio features and caption similarity to better predict human judgments, outperforming existing metrics.
Contribution
This work introduces MACE, a novel multimodal evaluation metric that integrates audio signals with caption similarity for improved audio captioning assessment.
Findings
MACE outperforms traditional metrics in predicting human quality judgments.
MACE achieves 3.28% and 4.36% relative accuracy improvements over FENSE.
MACE significantly outperforms previous metrics on audio captioning datasets.
Abstract
The Automated Audio Captioning (AAC) task aims to describe an audio signal using natural language. To evaluate machine-generated captions, the metrics should take into account audio events, acoustic scenes, paralinguistics, signal characteristics, and other audio information. Traditional AAC evaluation relies on natural language generation metrics like ROUGE and BLEU, image captioning metrics such as SPICE and CIDEr, or Sentence-BERT embedding similarity. However, these metrics only compare generated captions to human references, overlooking the audio signal itself. In this work, we propose MACE (Multimodal Audio-Caption Evaluation), a novel metric that integrates both audio and reference captions for comprehensive audio caption evaluation. MACE incorporates audio information from audio as well as predicted and reference captions and weights it with a fluency penalty. Our experiments…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Subtitles and Audiovisual Media · Video Analysis and Summarization
