Audio Captioning Transformer

Xinhao Mei; Xubo Liu; Qiushi Huang; Mark D. Plumbley; Wenwu Wang

arXiv:2107.09817·eess.AS·July 22, 2021·32 cites

Audio Captioning Transformer

Xinhao Mei, Xubo Liu, Qiushi Huang, Mark D. Plumbley, Wenwu Wang

PDF

Open Access 1 Repo

TL;DR

This paper introduces an Audio Captioning Transformer (ACT), a convolution-free encoder-decoder model that effectively captures global and temporal information in audio signals, achieving competitive results on the AudioCaps dataset.

Contribution

The paper presents a novel Transformer-based architecture for audio captioning that eliminates convolutional components, improving modeling of long-range dependencies and temporal relationships.

Findings

01

Achieves competitive performance on AudioCaps dataset.

02

Models global and temporal audio information effectively.

03

Outperforms some existing approaches in audio captioning.

Abstract

Audio captioning aims to automatically generate a natural language description of an audio clip. Most captioning models follow an encoder-decoder architecture, where the decoder predicts words based on the audio features extracted by the encoder. Convolutional neural networks (CNNs) and recurrent neural networks (RNNs) are often used as the audio encoder. However, CNNs can be limited in modelling temporal relationships among the time frames in an audio signal, while RNNs can be limited in modelling the long-range dependencies among the time frames. In this paper, we propose an Audio Captioning Transformer (ACT), which is a full Transformer network based on an encoder-decoder architecture and is totally convolution-free. The proposed method has a better ability to model the global information within an audio signal as well as capture temporal relationships between audio events. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

XinhaoMei/ACT
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Video Analysis and Summarization

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Label Smoothing · Residual Connection · Softmax · Dense Connections · Layer Normalization