Dynamic Convolutional Neural Networks as Efficient Pre-trained Audio Models
Florian Schmid, Khaled Koutini, Gerhard Widmer

TL;DR
This paper introduces dynamic CNN blocks with adaptive components, significantly improving efficiency and performance in audio tagging tasks, rivaling and surpassing Transformer models on large-scale datasets.
Contribution
It presents a novel dynamic CNN architecture with adaptive elements that enhance performance and efficiency, outperforming traditional CNNs and matching Transformer capabilities.
Findings
Dynamic CNNs outperform traditional CNNs in audio tagging.
Dynamic CNNs achieve comparable or better performance than Transformers.
The proposed models scale well to large datasets and downstream tasks.
Abstract
The introduction of large-scale audio datasets, such as AudioSet, paved the way for Transformers to conquer the audio domain and replace CNNs as the state-of-the-art neural network architecture for many tasks. Audio Spectrogram Transformers are excellent at exploiting large datasets, creating powerful pre-trained models that surpass CNNs when fine-tuned on downstream tasks. However, current popular Audio Spectrogram Transformers are demanding in terms of computational complexity compared to CNNs. Recently, we have shown that, by employing Transformer-to-CNN Knowledge Distillation, efficient CNNs can catch up with and even outperform Transformers on large datasets. In this work, we extend this line of research and increase the capacity of efficient CNNs by introducing dynamic CNN blocks, constructed of dynamic non-linearities, dynamic convolutions and attention mechanisms. We show that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Dense Connections · Absolute Position Encodings · Adam · Label Smoothing · Position-Wise Feed-Forward Layer · Residual Connection
