Dynamic Convolutional Neural Networks as Efficient Pre-trained Audio   Models

Florian Schmid; Khaled Koutini; Gerhard Widmer

arXiv:2310.15648·cs.SD·October 25, 2023·1 cites

Dynamic Convolutional Neural Networks as Efficient Pre-trained Audio Models

Florian Schmid, Khaled Koutini, Gerhard Widmer

PDF

Open Access 1 Repo

TL;DR

This paper introduces dynamic CNN blocks with adaptive components, significantly improving efficiency and performance in audio tagging tasks, rivaling and surpassing Transformer models on large-scale datasets.

Contribution

It presents a novel dynamic CNN architecture with adaptive elements that enhance performance and efficiency, outperforming traditional CNNs and matching Transformer capabilities.

Findings

01

Dynamic CNNs outperform traditional CNNs in audio tagging.

02

Dynamic CNNs achieve comparable or better performance than Transformers.

03

The proposed models scale well to large datasets and downstream tasks.

Abstract

The introduction of large-scale audio datasets, such as AudioSet, paved the way for Transformers to conquer the audio domain and replace CNNs as the state-of-the-art neural network architecture for many tasks. Audio Spectrogram Transformers are excellent at exploiting large datasets, creating powerful pre-trained models that surpass CNNs when fine-tuned on downstream tasks. However, current popular Audio Spectrogram Transformers are demanding in terms of computational complexity compared to CNNs. Recently, we have shown that, by employing Transformer-to-CNN Knowledge Distillation, efficient CNNs can catch up with and even outperform Transformers on large datasets. In this work, we extend this line of research and increase the capacity of efficient CNNs by introducing dynamic CNN blocks, constructed of dynamic non-linearities, dynamic convolutions and attention mechanisms. We show that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

fschmid56/efficientat
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Dense Connections · Absolute Position Encodings · Adam · Label Smoothing · Position-Wise Feed-Forward Layer · Residual Connection