Learning neural audio features without supervision

Sarthak Yadav; Neil Zeghidour

arXiv:2203.15519·cs.SD·March 30, 2022

Learning neural audio features without supervision

Sarthak Yadav, Neil Zeghidour

PDF

Open Access

TL;DR

This paper investigates combining learnable frontends with self-supervised pretraining for audio classification, showing that learnable representations can outperform fixed features and revealing surprising insights about filter initialization.

Contribution

It demonstrates the benefits of jointly pretraining learnable frontends with classification models and uncovers key differences in filter properties between supervised and self-supervised learning.

Findings

01

Pretraining learnable frontends improves performance over fixed features.

02

Randomly initialized filters outperform mel-scale initialization in self-supervised learning.

03

Self-supervised filters diverge from mel-scale to capture broader frequency ranges.

Abstract

Deep audio classification, traditionally cast as training a deep neural network on top of mel-filterbanks in a supervised fashion, has recently benefited from two independent lines of work. The first one explores "learnable frontends", i.e., neural modules that produce a learnable time-frequency representation, to overcome limitations of fixed features. The second one uses self-supervised learning to leverage unprecedented scales of pre-training data. In this work, we study the feasibility of combining both approaches, i.e., pre-training learnable frontend jointly with the main architecture for downstream classification. First, we show that pretraining two previously proposed frontends (SincNet and LEAF) on Audioset drastically improves linear-probe performance over fixed mel-filterbanks, suggesting that learnable time-frequency representations can benefit self-supervised pre-training…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Acoustic Wave Phenomena Research