Attention is All You Need? Good Embeddings with Statistics are enough:Large Scale Audio Understanding without Transformers/ Convolutions/ BERTs/ Mixers/ Attention/ RNNs or ....
Prateek Verma

TL;DR
This paper demonstrates that large-scale audio understanding can be achieved using simple statistical embeddings and a Bag-of-Words approach, without relying on complex neural architectures like Transformers or CNNs.
Contribution
It introduces a novel, architecture-free method for audio classification using clustered embeddings and MLPs, challenging the necessity of traditional neural networks.
Findings
Surpasses traditional CNN architectures in accuracy
Approaches the performance of Transformer-based models
Simplifies audio understanding with statistical embeddings
Abstract
This paper presents a way of doing large scale audio understanding without traditional state of the art neural architectures. Ever since the introduction of deep learning for understanding audio signals in the past decade, convolutional architectures have been able to achieve state of the art results surpassing traditional hand-crafted features. In the recent past, there has been a similar shift away from traditional convolutional and recurrent neural networks towards purely end-to-end Transformer architectures. We, in this work, explore an approach, based on Bag-of-Words model. Our approach does not have any convolutions, recurrence, attention, transformers or other approaches such as BERT. We utilize micro and macro level clustered vanilla embeddings, and use a MLP head for classification. We only use feed-forward encoder-decoder models to get the bottlenecks of spectral envelops,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis
MethodsAttention Is All You Need · *Communicated@Fast*How Do I Communicate to Expedia? · Linear Layer · 1x1 Convolution · Weight Decay · Average Pooling · Bottleneck Residual Block · Random Resized Crop · Kaiming Initialization · Refunds@Expedia|||How do I get a full refund from Expedia?
