Self-supervised audio representation learning for mobile devices

Marco Tagliasacchi; Beat Gfeller; F\'elix de Chaumont Quitry; Dominik; Roblek

arXiv:1905.11796·eess.AS·May 29, 2019·27 cites

Self-supervised audio representation learning for mobile devices

Marco Tagliasacchi, Beat Gfeller, F\'elix de Chaumont Quitry, Dominik, Roblek

PDF

Open Access

TL;DR

This paper presents self-supervised audio representation learning methods optimized for mobile devices, using temporal context techniques inspired by Word2Vec, achieving competitive performance on downstream tasks with small models.

Contribution

The paper introduces novel self-supervised learning methods for audio on mobile devices, focusing on temporal context exploitation and small encoder architectures.

Findings

01

Embeddings are effective across various downstream tasks.

02

Some models approach supervised performance levels.

03

Methods are suitable for privacy-preserving federated learning.

Abstract

We explore self-supervised models that can be potentially deployed on mobile devices to learn general purpose audio representations. Specifically, we propose methods that exploit the temporal context in the spectrogram domain. One method estimates the temporal gap between two short audio segments extracted at random from the same audio clip. The other methods are inspired by Word2Vec, a popular technique used to learn word embeddings, and aim at reconstructing a temporal spectrogram slice from past and future slices or, alternatively, at reconstructing the context of surrounding slices from the current slice. We focus our evaluation on small encoder architectures, which can be potentially run on mobile devices during both inference (re-using a common learned representation across multiple downstream tasks) and training (capturing the true data distribution without compromising users'…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis