Text-to-Audio Grounding Based Novel Metric for Evaluating Audio Caption Similarity
Swapnil Bhosale, Rupayan Chakraborty, Sunil Kumar Kopparapu

TL;DR
This paper introduces a novel Text-to-Audio Grounding (TAG) metric for evaluating audio captioning, which better captures sound properties in text compared to traditional lexical-based metrics.
Contribution
The paper proposes a new evaluation metric based on Text-to-Audio Grounding that improves assessment accuracy for Audio Captioning tasks.
Findings
TAG metric outperforms existing evaluation metrics
Better captures perceived sound properties in captions
Enhances evaluation of cross-modal audio-text tasks
Abstract
Automatic Audio Captioning (AAC) refers to the task of translating an audio sample into a natural language (NL) text that describes the audio events, source of the events and their relationships. Unlike NL text generation tasks, which rely on metrics like BLEU, ROUGE, METEOR based on lexical semantics for evaluation, the AAC evaluation metric requires an ability to map NL text (phrases) that correspond to similar sounds in addition lexical semantics. Current metrics used for evaluation of AAC tasks lack an understanding of the perceived properties of sound represented by text. In this paper, wepropose a novel metric based on Text-to-Audio Grounding (TAG), which is, useful for evaluating cross modal tasks like AAC. Experiments on publicly available AAC data-set shows our evaluation metric to perform better compared to existing metrics used in NL text and image captioning literature.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Subtitles and Audiovisual Media · Video Analysis and Summarization
