Training Audio Captioning Models without Audio
Soham Deshmukh, Benjamin Elizalde, Dimitra Emmanouilidou, Bhiksha Raj,, Rita Singh, Huaming Wang

TL;DR
This paper introduces a novel method for training audio captioning models solely using text data by leveraging contrastively trained audio-text models like CLAP, enabling effective audio captioning without paired audio-text datasets.
Contribution
The authors propose a text-only training framework for audio captioning that utilizes pretrained audio-text models and introduces techniques to bridge modality gaps, reducing reliance on costly audio-caption pairs.
Findings
The method achieves competitive performance with state-of-the-art models trained on paired data.
It enables stylized audio captioning and caption enrichment without audio or human captions.
The approach demonstrates effective transfer from text to audio modalities.
Abstract
Automated Audio Captioning (AAC) is the task of generating natural language descriptions given an audio stream. A typical AAC system requires manually curated training data of audio segments and corresponding text caption annotations. The creation of these audio-caption pairs is costly, resulting in general data scarcity for the task. In this work, we address this major limitation and propose an approach to train AAC systems using only text. Our approach leverages the multimodal space of contrastively trained audio-text models, such as CLAP. During training, a decoder generates captions conditioned on the pretrained CLAP text encoder. During inference, the text encoder is replaced with the pretrained CLAP audio encoder. To bridge the modality gap between text and audio embeddings, we propose the use of noise injection or a learnable adapter, during training. We find that the proposed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing
