TL;DR
This paper introduces a novel attention-free decoder that leverages local information in audio signals, improving caption accuracy by capturing both short and long-duration events, and outperforms existing methods in audio captioning tasks.
Contribution
The paper proposes a local information assisted attention-free Transformer decoder for audio captioning, addressing the limitation of existing attention-based decoders in capturing local, short-duration events.
Findings
Outperforms state-of-the-art methods in DCASE 2021 Challenge Task 6
Effectively captures local and global audio information
Improves caption accuracy for short-duration audio events
Abstract
Automated audio captioning aims to describe audio data with captions using natural language. Existing methods often employ an encoder-decoder structure, where the attention-based decoder (e.g., Transformer decoder) is widely used and achieves state-of-the-art performance. Although this method effectively captures global information within audio data via the self-attention mechanism, it may ignore the event with short time duration, due to its limitation in capturing local information in an audio signal, leading to inaccurate prediction of captions. To address this issue, we propose a method using the pretrained audio neural networks (PANNs) as the encoder and local information assisted attention-free Transformer (LocalAFT) as the decoder. The novelty of our method is in the proposal of the LocalAFT decoder, which allows local information within an audio signal to be captured while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAttention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Dropout · Layer Normalization · Dense Connections · Multi-Head Attention · Softmax · Byte Pair Encoding · Label Smoothing
