Killing two birds with one stone: Can an audio captioning system also be used for audio-text retrieval?
Etienne Labb\'e (IRIT-SAMoVA), Thomas Pellegrini (IRIT-SAMoVA), Julien, Pinquier (IRIT-SAMoVA)

TL;DR
This paper explores whether an audio captioning system can be repurposed for audio-text retrieval without additional training, showing promising results and potential for multi-task audio understanding.
Contribution
It demonstrates that an unmodified AAC system can effectively perform ATR tasks, achieving competitive results without fine-tuning or external data.
Findings
AAC system achieves high captioning scores on Clotho and AudioCaps.
The system attains a Text-to-Audio R@1 of 0.382 on AudioCaps, surpassing some state-of-the-art methods.
Normalizing loss values improves Audio-to-Text retrieval performance.
Abstract
Automated Audio Captioning (AAC) aims to develop systems capable of describing an audio recording using a textual sentence. In contrast, Audio-Text Retrieval (ATR) systems seek to find the best matching audio recording(s) for a given textual query (Text-to-Audio) or vice versa (Audio-to-Text). These tasks require different types of systems: AAC employs a sequence-to-sequence model, while ATR utilizes a ranking model that compares audio and text representations within a shared projection subspace. However, this work investigates the relationship between AAC and ATR by exploring the ATR capabilities of an unmodified AAC system, without fine-tuning for the new task. Our AAC system consists of an audio encoder (ConvNeXt-Tiny) trained on AudioSet for audio tagging, and a transformer decoder responsible for generating sentences. For AAC, it achieves a high SPIDEr-FL score of 0.298 on Clotho…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Natural Language Processing Techniques
