Non-Contrastive Self-supervised Learning for Utterance-Level Information Extraction from Speech
Jaejin Cho, Jes'us Villalba, Laureano Moro-Velazquez, Najim Dehak

TL;DR
This paper introduces a non-contrastive self-supervised learning approach, adapted from computer vision's DINO, for utterance-level speech representation, outperforming traditional supervised methods in various speech tasks.
Contribution
It applies a contrastive-free SSL method to speech, demonstrating improved transfer learning performance over supervised x-vector models across multiple speech applications.
Findings
DINO outperforms x-vector in downstream tasks
Fine-tuning strategies influence transfer performance
Augmentation benefits speech emotion recognition
Abstract
In recent studies, self-supervised pre-trained models tend to outperform supervised pre-trained models in transfer learning. In particular, self-supervised learning (SSL) of utterance-level speech representation can be used in speech applications that require discriminative representation of consistent attributes within an utterance: speaker, language, emotion, and age. Existing frame-level self-supervised speech representation, e.g., wav2vec, can be used as utterance-level representation with pooling, but the models are usually large. There are also SSL techniques to learn utterance-level representation. One of the most successful is a contrastive method, which requires negative sampling: selecting alternative samples to contrast with the current sample (anchor). However, this does not ensure that all the negative samples belong to classes different from the anchor class without…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsMulti-Head Attention · Attention Is All You Need · Softmax · Layer Normalization · Linear Layer · Dense Connections · Residual Connection · Vision Transformer
