Asca: less audio data is more insightful

Xiang Li; Junhao Chen; Chao Li; Hongwu Lv

arXiv:2309.13373·cs.SD·September 26, 2023

Asca: less audio data is more insightful

Xiang Li, Junhao Chen, Chao Li, Hongwu Lv

PDF

Open Access 1 Repo

TL;DR

ASCA introduces a hybrid Transformer-convolution model with novel attention and data strategies, significantly improving audio recognition accuracy in resource-limited specialized domains like birdsong and submarine acoustics.

Contribution

The paper presents a new hybrid architecture, ASCA, combining Transformer and convolutional components, tailored for effective audio recognition with limited data.

Findings

01

Achieved 81.2% accuracy on BirdCLEF2023

02

Reached 35.1% accuracy on AudioSet(Balanced)

03

Outperformed existing methods significantly

Abstract

Audio recognition in specialized areas such as birdsong and submarine acoustics faces challenges in large-scale pre-training due to the limitations in available samples imposed by sampling environments and specificity requirements. While the Transformer model excels in audio recognition, its dependence on vast amounts of data becomes restrictive in resource-limited settings. Addressing this, we introduce the Audio Spectrogram Convolution Attention (ASCA) based on CoAtNet, integrating a Transformer-convolution hybrid architecture, novel network design, and attention techniques, further augmented with data enhancement and regularization strategies. On the BirdCLEF2023 and AudioSet(Balanced), ASCA achieved accuracies of 81.2% and 35.1%, respectively, significantly outperforming competing methods. The unique structure of our model enriches output, enabling generalization across various…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

leeciang/asca
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis

MethodsMulti-Head Attention · Attention Is All You Need · Layer Normalization · Label Smoothing · Dropout · Convolution · Byte Pair Encoding · Absolute Position Encodings · Dense Connections · Linear Layer