Attention is All You Need? Good Embeddings with Statistics are   enough:Large Scale Audio Understanding without Transformers/ Convolutions/   BERTs/ Mixers/ Attention/ RNNs or ....

Prateek Verma

arXiv:2110.03183·cs.SD·February 1, 2022

Attention is All You Need? Good Embeddings with Statistics are enough:Large Scale Audio Understanding without Transformers/ Convolutions/ BERTs/ Mixers/ Attention/ RNNs or ....

Prateek Verma

PDF

Open Access

TL;DR

This paper demonstrates that large-scale audio understanding can be achieved using simple statistical embeddings and a Bag-of-Words approach, without relying on complex neural architectures like Transformers or CNNs.

Contribution

It introduces a novel, architecture-free method for audio classification using clustered embeddings and MLPs, challenging the necessity of traditional neural networks.

Findings

01

Surpasses traditional CNN architectures in accuracy

02

Approaches the performance of Transformer-based models

03

Simplifies audio understanding with statistical embeddings

Abstract

This paper presents a way of doing large scale audio understanding without traditional state of the art neural architectures. Ever since the introduction of deep learning for understanding audio signals in the past decade, convolutional architectures have been able to achieve state of the art results surpassing traditional hand-crafted features. In the recent past, there has been a similar shift away from traditional convolutional and recurrent neural networks towards purely end-to-end Transformer architectures. We, in this work, explore an approach, based on Bag-of-Words model. Our approach does not have any convolutions, recurrence, attention, transformers or other approaches such as BERT. We utilize micro and macro level clustered vanilla embeddings, and use a MLP head for classification. We only use feed-forward encoder-decoder models to get the bottlenecks of spectral envelops,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis

MethodsAttention Is All You Need · *Communicated@Fast*How Do I Communicate to Expedia? · Linear Layer · 1x1 Convolution · Weight Decay · Average Pooling · Bottleneck Residual Block · Random Resized Crop · Kaiming Initialization · Refunds@Expedia|||How do I get a full refund from Expedia?