Learning neural audio features without supervision
Sarthak Yadav, Neil Zeghidour

TL;DR
This paper investigates combining learnable frontends with self-supervised pretraining for audio classification, showing that learnable representations can outperform fixed features and revealing surprising insights about filter initialization.
Contribution
It demonstrates the benefits of jointly pretraining learnable frontends with classification models and uncovers key differences in filter properties between supervised and self-supervised learning.
Findings
Pretraining learnable frontends improves performance over fixed features.
Randomly initialized filters outperform mel-scale initialization in self-supervised learning.
Self-supervised filters diverge from mel-scale to capture broader frequency ranges.
Abstract
Deep audio classification, traditionally cast as training a deep neural network on top of mel-filterbanks in a supervised fashion, has recently benefited from two independent lines of work. The first one explores "learnable frontends", i.e., neural modules that produce a learnable time-frequency representation, to overcome limitations of fixed features. The second one uses self-supervised learning to leverage unprecedented scales of pre-training data. In this work, we study the feasibility of combining both approaches, i.e., pre-training learnable frontend jointly with the main architecture for downstream classification. First, we show that pretraining two previously proposed frontends (SincNet and LEAF) on Audioset drastically improves linear-probe performance over fixed mel-filterbanks, suggesting that learnable time-frequency representations can benefit self-supervised pre-training…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Acoustic Wave Phenomena Research
