Non-Contrastive Self-supervised Learning for Utterance-Level Information   Extraction from Speech

Jaejin Cho; Jes'us Villalba; Laureano Moro-Velazquez; Najim Dehak

arXiv:2208.05445·eess.AS·August 11, 2022

Non-Contrastive Self-supervised Learning for Utterance-Level Information Extraction from Speech

Jaejin Cho, Jes'us Villalba, Laureano Moro-Velazquez, Najim Dehak

PDF

TL;DR

This paper introduces a non-contrastive self-supervised learning approach, adapted from computer vision's DINO, for utterance-level speech representation, outperforming traditional supervised methods in various speech tasks.

Contribution

It applies a contrastive-free SSL method to speech, demonstrating improved transfer learning performance over supervised x-vector models across multiple speech applications.

Findings

01

DINO outperforms x-vector in downstream tasks

02

Fine-tuning strategies influence transfer performance

03

Augmentation benefits speech emotion recognition

Abstract

In recent studies, self-supervised pre-trained models tend to outperform supervised pre-trained models in transfer learning. In particular, self-supervised learning (SSL) of utterance-level speech representation can be used in speech applications that require discriminative representation of consistent attributes within an utterance: speaker, language, emotion, and age. Existing frame-level self-supervised speech representation, e.g., wav2vec, can be used as utterance-level representation with pooling, but the models are usually large. There are also SSL techniques to learn utterance-level representation. One of the most successful is a contrastive method, which requires negative sampling: selecting alternative samples to contrast with the current sample (anchor). However, this does not ensure that all the negative samples belong to classes different from the anchor class without…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsMulti-Head Attention · Attention Is All You Need · Softmax · Layer Normalization · Linear Layer · Dense Connections · Residual Connection · Vision Transformer