Improving Audio Captioning Using Semantic Similarity Metrics
Rehana Mahfuz, Yinyi Guo, Erik Visser

TL;DR
This paper introduces a semantic similarity-based metric for audio captioning evaluation, which better captures caption relevance and is used to fine-tune captioning models, leading to improved performance.
Contribution
It proposes a novel semantic similarity metric for audio captioning and demonstrates its effectiveness for model evaluation and optimization.
Findings
Semantic similarity metric correlates well with caption relevance.
Fine-tuning with the new metric improves caption quality.
The approach outperforms traditional metrics in evaluation.
Abstract
Audio captioning quality metrics which are typically borrowed from the machine translation and image captioning areas measure the degree of overlap between predicted tokens and gold reference tokens. In this work, we consider a metric measuring semantic similarities between predicted and reference captions instead of measuring exact word overlap. We first evaluate its ability to capture similarities among captions corresponding to the same audio file and compare it to other established metrics. We then propose a fine-tuning method to directly optimize the metric by backpropagating through a sentence embedding extractor and audio captioning network. Such fine-tuning results in an improvement in predicted captions as measured by both traditional metrics and the proposed semantic similarity captioning metric.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Video Analysis and Summarization · Multimodal Machine Learning Applications
