Advancing Natural-Language Based Audio Retrieval with PaSST and Large   Audio-Caption Data Sets

Paul Primus; Khaled Koutini; Gerhard Widmer

arXiv:2308.04258·eess.AS·August 9, 2023

Advancing Natural-Language Based Audio Retrieval with PaSST and Large Audio-Caption Data Sets

Paul Primus, Khaled Koutini, Gerhard Widmer

PDF

Open Access 1 Repo

TL;DR

This paper introduces a text-to-audio retrieval system using pre-trained transformers that effectively projects audio and text into a shared space, achieving top performance on benchmarks and highlighting key components for success.

Contribution

The work presents a novel approach combining pre-trained transformers and large datasets for improved text-to-audio retrieval, with systematic analysis of system components.

Findings

01

Ranked first in the 2023 DCASE Challenge.

02

Outperforms state-of-the-art on ClothoV2 by 5.6 percentage points mAP@10.

03

Identifies key roles of self-attention audio encoder and large datasets.

Abstract

This work presents a text-to-audio-retrieval system based on pre-trained text and spectrogram transformers. Our method projects recordings and textual descriptions into a shared audio-caption space in which related examples from different modalities are close. Through a systematic analysis, we examine how each component of the system influences retrieval performance. As a result, we identify two key components that play a crucial role in driving performance: the self-attention-based audio encoder for audio embedding and the utilization of additional human-generated and synthetic data sets during pre-training. We further experimented with augmenting ClothoV2 captions with available keywords to increase their variety; however, this only led to marginal improvements. Our system ranked first in the 2023's DCASE Challenge, and it outperforms the current state of the art on the ClothoV2…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

optimusprimus/dcase2023_task6b
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Diverse Musicological Studies · Speech Recognition and Synthesis