Text-based Audio Retrieval by Learning from Similarities between Audio   Captions

Huang Xie; Khazar Khorrami; Okko R\"as\"anen; Tuomas Virtanen

arXiv:2412.01356·eess.AS·December 3, 2024·IEEE Signal Process. Lett.

Text-based Audio Retrieval by Learning from Similarities between Audio Captions

Huang Xie, Khazar Khorrami, Okko R\"as\"anen, Tuomas Virtanen

PDF

Open Access

TL;DR

This paper introduces a novel approach for text-based audio retrieval that leverages the textual similarities between audio captions to estimate non-binary relevance scores, improving retrieval performance.

Contribution

It proposes a method to compute non-binary audio-caption relevance scores using Sentence-BERT similarities and integrates them into training with a listwise ranking objective.

Findings

01

Improved retrieval accuracy over binary relevance methods

02

Effective use of caption similarities for relevance estimation

03

Enhanced training with non-binary relevance scores

Abstract

This paper proposes to use similarities of audio captions for estimating audio-caption relevances to be used for training text-based audio retrieval systems. Current audio-caption datasets (e.g., Clotho) contain audio samples paired with annotated captions, but lack relevance information about audio samples and captions beyond the annotated ones. Besides, mainstream approaches (e.g., CLAP) usually treat the annotated pairs as positives and consider all other audio-caption combinations as negatives, assuming a binary relevance between audio samples and captions. To infer the relevance between audio samples and arbitrary captions, we propose a method that computes non-binary audio-caption relevance scores based on the textual similarities of audio captions. We measure textual similarities of audio captions by calculating the cosine similarity of their Sentence-BERT embeddings and then…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Music Technology and Sound Studies · Diverse Musicological Studies