Automatic Audio Captioning using Attention weighted Event based Embeddings
Swapnil Bhosale, Rupayan Chakraborty, Sunil Kumar Kopparapu

TL;DR
This paper introduces a lightweight encoder-decoder model for automatic audio captioning that leverages attention-weighted event embeddings and transfer learning from audio event detection models, achieving superior performance with fewer parameters.
Contribution
It proposes a novel, efficient AAC architecture using pre-trained AED models with attention mechanisms, outperforming more complex existing methods.
Findings
Attention-weighted embeddings improve captioning accuracy.
Transfer learning from AED models enhances performance.
The model requires fewer parameters than traditional architectures.
Abstract
Automatic Audio Captioning (AAC) refers to the task of translating audio into a natural language that describes the audio events, source of the events and their relationships. The limited samples in AAC datasets at present, has set up a trend to incorporate transfer learning with Audio Event Detection (AED) as a parent task. Towards this direction, in this paper, we propose an encoder-decoder architecture with light-weight (i.e. with lesser learnable parameters) Bi-LSTM recurrent layers for AAC and compare the performance of two state-of-the-art pre-trained AED models as embedding extractors. Our results show that an efficient AED based embedding extractor combined with temporal attention and augmentation techniques is able to surpass existing literature with computationally intensive architectures. Further, we provide evidence of the ability of the non-uniform attention weighted…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis
