Weakly-supervised Automated Audio Captioning via text only training

Theodoros Kouzelis; Vassilis Katsouros

arXiv:2309.12242·cs.SD·September 22, 2023

Weakly-supervised Automated Audio Captioning via text only training

Theodoros Kouzelis, Vassilis Katsouros

PDF

Open Access 1 Repo

TL;DR

This paper introduces a weakly-supervised method for automated audio captioning that relies solely on text data and a pre-trained CLAP model, reducing the need for labor-intensive paired datasets.

Contribution

It presents a novel approach leveraging contrastive pretraining to enable audio captioning without paired audio-text data, bridging the modality gap during training and inference.

Findings

01

Achieves up to 83% performance of fully supervised models

02

Utilizes CLAP embeddings for training without paired data

03

Demonstrates effectiveness on Clotho and AudioCaps datasets

Abstract

In recent years, datasets of paired audio and captions have enabled remarkable success in automatically generating descriptions for audio clips, namely Automated Audio Captioning (AAC). However, it is labor-intensive and time-consuming to collect a sufficient number of paired audio and captions. Motivated by the recent advances in Contrastive Language-Audio Pretraining (CLAP), we propose a weakly-supervised approach to train an AAC model assuming only text data and a pre-trained CLAP model, alleviating the need for paired target data. Our approach leverages the similarity between audio and text embeddings in CLAP. During training, we learn to reconstruct the text from the CLAP text embedding, and during inference, we decode using the audio embeddings. To mitigate the modality gap between the audio and text embeddings we employ strategies to bridge the gap during training and inference…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zelaki/wsac
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis