MACE: Leveraging Audio for Evaluating Audio Captioning Systems

Satvik Dixit; Soham Deshmukh; Bhiksha Raj

arXiv:2411.00321·cs.SD·November 6, 2024

MACE: Leveraging Audio for Evaluating Audio Captioning Systems

Satvik Dixit, Soham Deshmukh, Bhiksha Raj

PDF

Open Access 1 Repo

TL;DR

MACE is a new evaluation metric for audio captioning that combines audio features and caption similarity to better predict human judgments, outperforming existing metrics.

Contribution

This work introduces MACE, a novel multimodal evaluation metric that integrates audio signals with caption similarity for improved audio captioning assessment.

Findings

01

MACE outperforms traditional metrics in predicting human quality judgments.

02

MACE achieves 3.28% and 4.36% relative accuracy improvements over FENSE.

03

MACE significantly outperforms previous metrics on audio captioning datasets.

Abstract

The Automated Audio Captioning (AAC) task aims to describe an audio signal using natural language. To evaluate machine-generated captions, the metrics should take into account audio events, acoustic scenes, paralinguistics, signal characteristics, and other audio information. Traditional AAC evaluation relies on natural language generation metrics like ROUGE and BLEU, image captioning metrics such as SPICE and CIDEr, or Sentence-BERT embedding similarity. However, these metrics only compare generated captions to human references, overlooking the audio signal itself. In this work, we propose MACE (Multimodal Audio-Caption Evaluation), a novel metric that integrates both audio and reference captions for comprehensive audio caption evaluation. MACE incorporates audio information from audio as well as predicted and reference captions and weights it with a fluency penalty. Our experiments…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

satvik-dixit/mace
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Subtitles and Audiovisual Media · Video Analysis and Summarization