Killing two birds with one stone: Can an audio captioning system also be   used for audio-text retrieval?

Etienne Labb\'e (IRIT-SAMoVA); Thomas Pellegrini (IRIT-SAMoVA); Julien; Pinquier (IRIT-SAMoVA)

arXiv:2308.15090·cs.CL·August 30, 2023

Killing two birds with one stone: Can an audio captioning system also be used for audio-text retrieval?

Etienne Labb\'e (IRIT-SAMoVA), Thomas Pellegrini (IRIT-SAMoVA), Julien, Pinquier (IRIT-SAMoVA)

PDF

Open Access

TL;DR

This paper explores whether an audio captioning system can be repurposed for audio-text retrieval without additional training, showing promising results and potential for multi-task audio understanding.

Contribution

It demonstrates that an unmodified AAC system can effectively perform ATR tasks, achieving competitive results without fine-tuning or external data.

Findings

01

AAC system achieves high captioning scores on Clotho and AudioCaps.

02

The system attains a Text-to-Audio R@1 of 0.382 on AudioCaps, surpassing some state-of-the-art methods.

03

Normalizing loss values improves Audio-to-Text retrieval performance.

Abstract

Automated Audio Captioning (AAC) aims to develop systems capable of describing an audio recording using a textual sentence. In contrast, Audio-Text Retrieval (ATR) systems seek to find the best matching audio recording(s) for a given textual query (Text-to-Audio) or vice versa (Audio-to-Text). These tasks require different types of systems: AAC employs a sequence-to-sequence model, while ATR utilizes a ranking model that compares audio and text representations within a shared projection subspace. However, this work investigates the relationship between AAC and ATR by exploring the ATR capabilities of an unmodified AAC system, without fine-tuning for the new task. Our AAC system consists of an audio encoder (ConvNeXt-Tiny) trained on AudioSet for audio tagging, and a transformer decoder responsible for generating sentences. For AAC, it achieves a high SPIDEr-FL score of 0.298 on Clotho…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Natural Language Processing Techniques