CoNeTTE: An efficient Audio Captioning system leveraging multiple datasets with Task Embedding
\'Etienne Labb\'e, Thomas Pellegrini, Julien Pinquier

TL;DR
CoNeTTE introduces a novel audio captioning system using a ConvNeXt encoder and task embeddings, achieving state-of-the-art results efficiently across multiple datasets with fewer parameters.
Contribution
The paper presents CoNeTTE, a new audio captioning model that leverages a ConvNeXt encoder and dataset-specific task embeddings for improved cross-dataset performance.
Findings
State-of-the-art scores on AudioCaps dataset.
Fewer parameters than existing models.
Task embeddings improve cross-dataset generalization.
Abstract
Automated Audio Captioning (AAC) involves generating natural language descriptions of audio content, using encoder-decoder architectures. An audio encoder produces audio embeddings fed to a decoder, usually a Transformer decoder, for caption generation. In this work, we describe our model, which novelty, compared to existing models, lies in the use of a ConvNeXt architecture as audio encoder, adapted from the vision domain to audio classification. This model, called CNext-trans, achieved state-of-the-art scores on the AudioCaps (AC) dataset and performed competitively on Clotho (CL), while using four to forty times fewer parameters than existing models. We examine potential biases in the AC dataset due to its origin from AudioSet by investigating unbiased encoder's impact on performance. Using the well-known PANN's CNN14, for instance, as an unbiased encoder, we observed a 1.7% absolute…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Video Analysis and Summarization
