Estimated Audio-Caption Correspondences Improve Language-Based Audio   Retrieval

Paul Primus; Florian Schmid; Gerhard Widmer

arXiv:2408.11641·eess.AS·August 22, 2024

Estimated Audio-Caption Correspondences Improve Language-Based Audio Retrieval

Paul Primus, Florian Schmid, Gerhard Widmer

PDF

Open Access 1 Repo

TL;DR

This paper introduces a two-stage training method for audio-caption retrieval that estimates correspondences between audio and text, improving retrieval accuracy by leveraging predicted matches instead of random pairings.

Contribution

The authors propose a novel two-stage training approach that uses estimated audio-caption correspondences to enhance retrieval performance, outperforming existing methods.

Findings

01

Improved retrieval performance on ClothoV2 and AudioCaps benchmarks.

02

Outperforms state-of-the-art by 1.6 percentage points in mAP@10 on ClothoV2.

03

Effective even with a single model generating and learning from estimated correspondences.

Abstract

Dual-encoder-based audio retrieval systems are commonly optimized with contrastive learning on a set of matching and mismatching audio-caption pairs. This leads to a shared embedding space in which corresponding items from the two modalities end up close together. Since audio-caption datasets typically only contain matching pairs of recordings and descriptions, it has become common practice to create mismatching pairs by pairing the audio with a caption randomly drawn from the dataset. This is not ideal because the randomly sampled caption could, just by chance, partly or entirely describe the audio recording. However, correspondence information for all possible pairs is costly to annotate and thus typically unavailable; we, therefore, suggest substituting it with estimated correspondences. To this end, we propose a two-staged training procedure in which multiple retrieval models are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

optimusprimus/salsa
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Video Analysis and Summarization · Diverse Musicological Studies

MethodsSparse Evolutionary Training · Contrastive Learning