EAViT: External Attention Vision Transformer for Audio Classification

Aquib Iqbal; Abid Hasan Zim; Md Asaduzzaman Tonmoy; Limengnan Zhou,; Asad Malik; Minoru Kuribayashi

arXiv:2408.13201·cs.SD·August 26, 2024

EAViT: External Attention Vision Transformer for Audio Classification

Aquib Iqbal, Abid Hasan Zim, Md Asaduzzaman Tonmoy, Limengnan Zhou,, Asad Malik, Minoru Kuribayashi

PDF

Open Access

TL;DR

This paper introduces EAViT, a novel external attention vision transformer model that significantly improves audio classification accuracy by capturing long-range dependencies with learnable memory units, outperforming existing methods.

Contribution

The paper proposes the EAViT model, integrating multi-head external attention into Vision Transformer for enhanced audio classification performance.

Findings

01

EAViT achieves 93.99% accuracy on GTZAN dataset.

02

External attention improves long-range dependency modeling.

03

Outperforms state-of-the-art audio classification models.

Abstract

This paper presents the External Attention Vision Transformer (EAViT) model, a novel approach designed to enhance audio classification accuracy. As digital audio resources proliferate, the demand for precise and efficient audio classification systems has intensified, driven by the need for improved recommendation systems and user personalization in various applications, including music streaming platforms and environmental sound recognition. Accurate audio classification is crucial for organizing vast audio libraries into coherent categories, enabling users to find and interact with their preferred audio content more effectively. In this study, we utilize the GTZAN dataset, which comprises 1,000 music excerpts spanning ten diverse genres. Each 30-second audio clip is segmented into 3-second excerpts to enhance dataset robustness and mitigate overfitting risks, allowing for more granular…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Neural Networks and Applications

MethodsLinear Layer · Adam · Layer Normalization · Attention Is All You Need · Position-Wise Feed-Forward Layer · Dense Connections · Residual Connection · Multi-Head Attention · Byte Pair Encoding · Absolute Position Encodings