Advancing Natural-Language Based Audio Retrieval with PaSST and Large Audio-Caption Data Sets
Paul Primus, Khaled Koutini, Gerhard Widmer

TL;DR
This paper introduces a text-to-audio retrieval system using pre-trained transformers that effectively projects audio and text into a shared space, achieving top performance on benchmarks and highlighting key components for success.
Contribution
The work presents a novel approach combining pre-trained transformers and large datasets for improved text-to-audio retrieval, with systematic analysis of system components.
Findings
Ranked first in the 2023 DCASE Challenge.
Outperforms state-of-the-art on ClothoV2 by 5.6 percentage points mAP@10.
Identifies key roles of self-attention audio encoder and large datasets.
Abstract
This work presents a text-to-audio-retrieval system based on pre-trained text and spectrogram transformers. Our method projects recordings and textual descriptions into a shared audio-caption space in which related examples from different modalities are close. Through a systematic analysis, we examine how each component of the system influences retrieval performance. As a result, we identify two key components that play a crucial role in driving performance: the self-attention-based audio encoder for audio embedding and the utilization of additional human-generated and synthetic data sets during pre-training. We further experimented with augmenting ClothoV2 captions with available keywords to increase their variety; however, this only led to marginal improvements. Our system ranked first in the 2023's DCASE Challenge, and it outperforms the current state of the art on the ClothoV2…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Diverse Musicological Studies · Speech Recognition and Synthesis
