Text-to-Audio Grounding Based Novel Metric for Evaluating Audio Caption   Similarity

Swapnil Bhosale; Rupayan Chakraborty; Sunil Kumar Kopparapu

arXiv:2210.06354·cs.CL·October 13, 2022

Text-to-Audio Grounding Based Novel Metric for Evaluating Audio Caption Similarity

Swapnil Bhosale, Rupayan Chakraborty, Sunil Kumar Kopparapu

PDF

Open Access

TL;DR

This paper introduces a novel Text-to-Audio Grounding (TAG) metric for evaluating audio captioning, which better captures sound properties in text compared to traditional lexical-based metrics.

Contribution

The paper proposes a new evaluation metric based on Text-to-Audio Grounding that improves assessment accuracy for Audio Captioning tasks.

Findings

01

TAG metric outperforms existing evaluation metrics

02

Better captures perceived sound properties in captions

03

Enhances evaluation of cross-modal audio-text tasks

Abstract

Automatic Audio Captioning (AAC) refers to the task of translating an audio sample into a natural language (NL) text that describes the audio events, source of the events and their relationships. Unlike NL text generation tasks, which rely on metrics like BLEU, ROUGE, METEOR based on lexical semantics for evaluation, the AAC evaluation metric requires an ability to map NL text (phrases) that correspond to similar sounds in addition lexical semantics. Current metrics used for evaluation of AAC tasks lack an understanding of the perceived properties of sound represented by text. In this paper, wepropose a novel metric based on Text-to-Audio Grounding (TAG), which is, useful for evaluating cross modal tasks like AAC. Experiments on publicly available AAC data-set shows our evaluation metric to perform better compared to existing metrics used in NL text and image captioning literature.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Subtitles and Audiovisual Media · Video Analysis and Summarization