Audio Captioning with Composition of Acoustic and Semantic Information
Ay\c{s}eg\"ul \"Ozkaya Eren, Mustafa Sert

TL;DR
This paper introduces a novel encoder-decoder model for audio captioning that integrates semantic and acoustic information, demonstrating improved performance over existing methods on two datasets.
Contribution
The study presents a new BiGRU-based architecture combining audio and semantic embeddings, with a semantic prediction module for test audio, advancing audio captioning techniques.
Findings
Outperforms state-of-the-art models on Clotho and AudioCaps datasets
Semantic information enhances captioning accuracy
Effective use of multiple audio features
Abstract
Generating audio captions is a new research area that combines audio and natural language processing to create meaningful textual descriptions for audio clips. To address this problem, previous studies mostly use the encoder-decoder based models without considering semantic information. To fill this gap, we present a novel encoder-decoder architecture using bi-directional Gated Recurrent Units (BiGRU) with audio and semantic embeddings. We extract semantic embedding by obtaining subjects and verbs from the audio clip captions and combine these embedding with audio embedding to feed the BiGRU-based encoder-decoder model. To enable semantic embeddings for the test audios, we introduce a Multilayer Perceptron classifier to predict the semantic embeddings of those clips. We also present exhaustive experiments to show the efficiency of different features and datasets for our proposed model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsBidirectional GRU
