CoNeTTE: An efficient Audio Captioning system leveraging multiple   datasets with Task Embedding

\'Etienne Labb\'e; Thomas Pellegrini; Julien Pinquier

arXiv:2309.00454·cs.SD·September 4, 2023

CoNeTTE: An efficient Audio Captioning system leveraging multiple datasets with Task Embedding

\'Etienne Labb\'e, Thomas Pellegrini, Julien Pinquier

PDF

Open Access 1 Repo 1 Models

TL;DR

CoNeTTE introduces a novel audio captioning system using a ConvNeXt encoder and task embeddings, achieving state-of-the-art results efficiently across multiple datasets with fewer parameters.

Contribution

The paper presents CoNeTTE, a new audio captioning model that leverages a ConvNeXt encoder and dataset-specific task embeddings for improved cross-dataset performance.

Findings

01

State-of-the-art scores on AudioCaps dataset.

02

Fewer parameters than existing models.

03

Task embeddings improve cross-dataset generalization.

Abstract

Automated Audio Captioning (AAC) involves generating natural language descriptions of audio content, using encoder-decoder architectures. An audio encoder produces audio embeddings fed to a decoder, usually a Transformer decoder, for caption generation. In this work, we describe our model, which novelty, compared to existing models, lies in the use of a ConvNeXt architecture as audio encoder, adapted from the vision domain to audio classification. This model, called CNext-trans, achieved state-of-the-art scores on the AudioCaps (AC) dataset and performed competitively on Clotho (CL), while using four to forty times fewer parameters than existing models. We examine potential biases in the AC dataset due to its origin from AudioSet by investigating unbiased encoder's impact on performance. Using the well-known PANN's CNN14, for instance, as an unbiased encoder, we observed a 1.7% absolute…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

labbeti/conette-audio-captioning
pytorchOfficial

Models

🤗
Labbeti/conette
model· 114 dl· ♡ 1
114 dl♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Video Analysis and Summarization