RECAP: Retrieval-Augmented Audio Captioning
Sreyan Ghosh, Sonal Kumar, Chandra Kiran Reddy Evuru, Ramani, Duraiswami, Dinesh Manocha

TL;DR
RECAP is a retrieval-augmented audio captioning system that leverages retrieved similar captions to generate accurate descriptions, demonstrating strong performance across domains without additional fine-tuning and enabling novel audio event captioning.
Contribution
The paper introduces RECAP, a novel retrieval-augmented approach for audio captioning that operates without fine-tuning and can handle unseen audio events and complex compositions.
Findings
Achieves competitive in-domain captioning performance.
Significantly improves out-of-domain captioning accuracy.
Demonstrates ability to caption novel and compositional audio events.
Abstract
We present RECAP (REtrieval-Augmented Audio CAPtioning), a novel and effective audio captioning system that generates captions conditioned on an input audio and other captions similar to the audio retrieved from a datastore. Additionally, our proposed method can transfer to any domain without the need for any additional fine-tuning. To generate a caption for an audio sample, we leverage an audio-text model CLAP to retrieve captions similar to it from a replaceable datastore, which are then used to construct a prompt. Next, we feed this prompt to a GPT-2 decoder and introduce cross-attention layers between the CLAP encoder and GPT-2 to condition the audio for caption generation. Experiments on two benchmark datasets, Clotho and AudioCaps, show that RECAP achieves competitive performance in in-domain settings and significant improvements in out-of-domain settings. Additionally, due to its…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Diverse Musicological Studies
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Residual Connection · Adam · Linear Layer · Discriminative Fine-Tuning · Weight Decay · Multi-Head Attention · Dropout · Byte Pair Encoding
