TL;DR
This paper introduces BYOL-A, a self-supervised learning method for general-purpose audio representation that learns from single audio segments without relying on segment relationships, achieving state-of-the-art results.
Contribution
It presents a novel BYOL-based approach for audio that does not depend on segment relationships, expanding self-supervised learning to broader audio applications.
Findings
Achieves state-of-the-art results in various audio tasks.
Effective in learning from a single audio segment without segment relationships.
Component ablations clarify the importance of each method part.
Abstract
Inspired by the recent progress in self-supervised learning for computer vision that generates supervision using data augmentations, we explore a new general-purpose audio representation learning approach. We propose learning general-purpose audio representation from a single audio segment without expecting relationships between different time segments of audio samples. To implement this principle, we introduce Bootstrap Your Own Latent (BYOL) for Audio (BYOL-A, pronounced "viola"), an audio self-supervised learning method based on BYOL for learning general-purpose audio representation. Unlike most previous audio self-supervised learning methods that rely on agreement of vicinity audio segments or disagreement of remote ones, BYOL-A creates contrasts in an augmented audio segment pair derived from a single audio segment. With a combination of normalization and augmentation techniques,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsMixup · Bootstrap Your Own Latent
