TL;DR
This paper introduces three new benchmarks for cross-modal text-audio and audio-text retrieval tasks, enabling more effective search of audio content using natural language descriptions, and establishes baseline results demonstrating the benefits of pre-training.
Contribution
The paper presents new challenging benchmarks for text-audio and audio-text retrieval, constructed from existing datasets and a novel dataset, facilitating future research in this area.
Findings
Pre-training on diverse audio tasks improves retrieval performance.
The benchmarks enable standardized evaluation of text-audio retrieval methods.
Baseline results demonstrate the effectiveness of the proposed datasets.
Abstract
The objectives of this work are cross-modal text-audio and audio-text retrieval, in which the goal is to retrieve the audio content from a pool of candidates that best matches a given written description and vice versa. Text-audio retrieval enables users to search large databases through an intuitive interface: they simply issue free-form natural language descriptions of the sound they would like to hear. To study the tasks of text-audio and audio-text retrieval, which have received limited attention in the existing literature, we introduce three challenging new benchmarks. We first construct text-audio and audio-text retrieval benchmarks from the AudioCaps and Clotho audio captioning datasets. Additionally, we introduce the SoundDescs benchmark, which consists of paired audio and natural language descriptions for a diverse collection of sounds that are complementary to those found in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
