Audio Captioning with Composition of Acoustic and Semantic Information

Ay\c{s}eg\"ul \"Ozkaya Eren; Mustafa Sert

arXiv:2105.06355·cs.SD·May 14, 2021

Audio Captioning with Composition of Acoustic and Semantic Information

Ay\c{s}eg\"ul \"Ozkaya Eren, Mustafa Sert

PDF

TL;DR

This paper introduces a novel encoder-decoder model for audio captioning that integrates semantic and acoustic information, demonstrating improved performance over existing methods on two datasets.

Contribution

The study presents a new BiGRU-based architecture combining audio and semantic embeddings, with a semantic prediction module for test audio, advancing audio captioning techniques.

Findings

01

Outperforms state-of-the-art models on Clotho and AudioCaps datasets

02

Semantic information enhances captioning accuracy

03

Effective use of multiple audio features

Abstract

Generating audio captions is a new research area that combines audio and natural language processing to create meaningful textual descriptions for audio clips. To address this problem, previous studies mostly use the encoder-decoder based models without considering semantic information. To fill this gap, we present a novel encoder-decoder architecture using bi-directional Gated Recurrent Units (BiGRU) with audio and semantic embeddings. We extract semantic embedding by obtaining subjects and verbs from the audio clip captions and combine these embedding with audio embedding to feed the BiGRU-based encoder-decoder model. To enable semantic embeddings for the test audios, we introduce a Multilayer Perceptron classifier to predict the semantic embeddings of those clips. We also present exhaustive experiments to show the efficiency of different features and datasets for our proposed model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsBidirectional GRU