RECAP: Retrieval-Augmented Audio Captioning

Sreyan Ghosh; Sonal Kumar; Chandra Kiran Reddy Evuru; Ramani; Duraiswami; Dinesh Manocha

arXiv:2309.09836·eess.AS·June 7, 2024

RECAP: Retrieval-Augmented Audio Captioning

Sreyan Ghosh, Sonal Kumar, Chandra Kiran Reddy Evuru, Ramani, Duraiswami, Dinesh Manocha

PDF

Open Access 1 Repo

TL;DR

RECAP is a retrieval-augmented audio captioning system that leverages retrieved similar captions to generate accurate descriptions, demonstrating strong performance across domains without additional fine-tuning and enabling novel audio event captioning.

Contribution

The paper introduces RECAP, a novel retrieval-augmented approach for audio captioning that operates without fine-tuning and can handle unseen audio events and complex compositions.

Findings

01

Achieves competitive in-domain captioning performance.

02

Significantly improves out-of-domain captioning accuracy.

03

Demonstrates ability to caption novel and compositional audio events.

Abstract

We present RECAP (REtrieval-Augmented Audio CAPtioning), a novel and effective audio captioning system that generates captions conditioned on an input audio and other captions similar to the audio retrieved from a datastore. Additionally, our proposed method can transfer to any domain without the need for any additional fine-tuning. To generate a caption for an audio sample, we leverage an audio-text model CLAP to retrieve captions similar to it from a replaceable datastore, which are then used to construct a prompt. Next, we feed this prompt to a GPT-2 decoder and introduce cross-attention layers between the CLAP encoder and GPT-2 to condition the audio for caption generation. Experiments on two benchmark datasets, Clotho and AudioCaps, show that RECAP achieves competitive performance in in-domain settings and significant improvements in out-of-domain settings. Additionally, due to its…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sreyan88/recap
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Diverse Musicological Studies

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Residual Connection · Adam · Linear Layer · Discriminative Fine-Tuning · Weight Decay · Multi-Head Attention · Dropout · Byte Pair Encoding