DRCap: Decoding CLAP Latents with Retrieval-Augmented Generation for   Zero-shot Audio Captioning

Xiquan Li; Wenxi Chen; Ziyang Ma; Xuenan Xu; Yuzhe Liang; Zhisheng; Zheng; Qiuqiang Kong; Xie Chen

arXiv:2410.09472·cs.SD·January 7, 2025

DRCap: Decoding CLAP Latents with Retrieval-Augmented Generation for Zero-shot Audio Captioning

Xiquan Li, Wenxi Chen, Ziyang Ma, Xuenan Xu, Yuzhe Liang, Zhisheng, Zheng, Qiuqiang Kong, Xie Chen

PDF

Open Access 1 Repo

TL;DR

DRCap is a zero-shot audio captioning system that leverages CLAP and LLMs, using retrieval and projection strategies to generate accurate captions without domain-specific training.

Contribution

It introduces a novel retrieval-augmented generation approach combining CLAP and LLMs for flexible, domain-adaptive zero-shot audio captioning requiring only text data for training.

Findings

01

Outperforms existing zero-shot models in in-domain scenarios.

02

Achieves state-of-the-art results in cross-domain audio captioning.

03

Demonstrates effective domain adaptation without additional fine-tuning.

Abstract

While automated audio captioning (AAC) has made notable progress, traditional fully supervised AAC models still face two critical challenges: the need for expensive audio-text pair data for training and performance degradation when transferring across domains. To overcome these limitations, we present DRCap, a data-efficient and flexible zero-shot audio captioning system that requires text-only data for training and can quickly adapt to new domains without additional fine-tuning. DRCap integrates a contrastive language-audio pre-training (CLAP) model and a large-language model (LLM) as its backbone. During training, the model predicts the ground-truth caption with a fixed text encoder from CLAP, whereas, during inference, the text encoder is replaced with the audio encoder to generate captions for audio clips in a zero-shot manner. To mitigate the modality gap of the CLAP model, we use…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

X-LANCE/SLAM-LLM
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Music and Audio Processing · Subtitles and Audiovisual Media