DRCap: Decoding CLAP Latents with Retrieval-Augmented Generation for Zero-shot Audio Captioning
Xiquan Li, Wenxi Chen, Ziyang Ma, Xuenan Xu, Yuzhe Liang, Zhisheng, Zheng, Qiuqiang Kong, Xie Chen

TL;DR
DRCap is a zero-shot audio captioning system that leverages CLAP and LLMs, using retrieval and projection strategies to generate accurate captions without domain-specific training.
Contribution
It introduces a novel retrieval-augmented generation approach combining CLAP and LLMs for flexible, domain-adaptive zero-shot audio captioning requiring only text data for training.
Findings
Outperforms existing zero-shot models in in-domain scenarios.
Achieves state-of-the-art results in cross-domain audio captioning.
Demonstrates effective domain adaptation without additional fine-tuning.
Abstract
While automated audio captioning (AAC) has made notable progress, traditional fully supervised AAC models still face two critical challenges: the need for expensive audio-text pair data for training and performance degradation when transferring across domains. To overcome these limitations, we present DRCap, a data-efficient and flexible zero-shot audio captioning system that requires text-only data for training and can quickly adapt to new domains without additional fine-tuning. DRCap integrates a contrastive language-audio pre-training (CLAP) model and a large-language model (LLM) as its backbone. During training, the model predicts the ground-truth caption with a fixed text encoder from CLAP, whereas, during inference, the text encoder is replaced with the audio encoder to generate captions for audio clips in a zero-shot manner. To mitigate the modality gap of the CLAP model, we use…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Music and Audio Processing · Subtitles and Audiovisual Media
