Asca: less audio data is more insightful
Xiang Li, Junhao Chen, Chao Li, Hongwu Lv

TL;DR
ASCA introduces a hybrid Transformer-convolution model with novel attention and data strategies, significantly improving audio recognition accuracy in resource-limited specialized domains like birdsong and submarine acoustics.
Contribution
The paper presents a new hybrid architecture, ASCA, combining Transformer and convolutional components, tailored for effective audio recognition with limited data.
Findings
Achieved 81.2% accuracy on BirdCLEF2023
Reached 35.1% accuracy on AudioSet(Balanced)
Outperformed existing methods significantly
Abstract
Audio recognition in specialized areas such as birdsong and submarine acoustics faces challenges in large-scale pre-training due to the limitations in available samples imposed by sampling environments and specificity requirements. While the Transformer model excels in audio recognition, its dependence on vast amounts of data becomes restrictive in resource-limited settings. Addressing this, we introduce the Audio Spectrogram Convolution Attention (ASCA) based on CoAtNet, integrating a Transformer-convolution hybrid architecture, novel network design, and attention techniques, further augmented with data enhancement and regularization strategies. On the BirdCLEF2023 and AudioSet(Balanced), ASCA achieved accuracies of 81.2% and 35.1%, respectively, significantly outperforming competing methods. The unique structure of our model enriches output, enabling generalization across various…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis
MethodsMulti-Head Attention · Attention Is All You Need · Layer Normalization · Label Smoothing · Dropout · Convolution · Byte Pair Encoding · Absolute Position Encodings · Dense Connections · Linear Layer
