Conformer-Based Self-Supervised Learning for Non-Speech Audio Tasks

Sangeeta Srivastava; Yun Wang; Andros Tjandra; Anurag Kumar; Chunxi; Liu; Kritika Singh; Yatharth Saraf

arXiv:2110.07313·cs.SD·January 10, 2022

Conformer-Based Self-Supervised Learning for Non-Speech Audio Tasks

Sangeeta Srivastava, Yun Wang, Andros Tjandra, Anurag Kumar, Chunxi, Liu, Kritika Singh, Yatharth Saraf

PDF

Open Access

TL;DR

This paper introduces a conformer-based self-supervised learning approach for non-speech audio tasks, significantly reducing labeled data needs and achieving state-of-the-art results on AudioSet.

Contribution

It combines wav2vec 2.0 with conformer architectures for effective self-supervised learning on non-speech audio, a less-explored area.

Findings

01

Achieves a 0.415 mAP on AudioSet, setting a new state-of-the-art.

02

Reduces labeled data requirement by two-thirds.

03

Surpasses or matches supervised pre-training performance on multiple tasks.

Abstract

Representation learning from unlabeled data has been of major interest in artificial intelligence research. While self-supervised speech representation learning has been popular in the speech research community, very few works have comprehensively analyzed audio representation learning for non-speech audio tasks. In this paper, we propose a self-supervised audio representation learning method and apply it to a variety of downstream non-speech audio tasks. We combine the well-known wav2vec 2.0 framework, which has shown success in self-supervised learning for speech tasks, with parameter-efficient conformer architectures. Our self-supervised pre-training can reduce the need for labeled data by two-thirds. On the AudioSet benchmark, we achieve a mean average precision (mAP) score of 0.415, which is a new state-of-the-art on this dataset through audio-only self-supervised learning. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis