DinoSR: Self-Distillation and Online Clustering for Self-supervised   Speech Representation Learning

Alexander H. Liu; Heng-Jui Chang; Michael Auli; Wei-Ning Hsu; James R.; Glass

arXiv:2305.10005·cs.CL·January 17, 2024·6 cites

DinoSR: Self-Distillation and Online Clustering for Self-supervised Speech Representation Learning

Alexander H. Liu, Heng-Jui Chang, Michael Auli, Wei-Ning Hsu, James R., Glass

PDF

Open Access 1 Repo 2 Models 1 Video

TL;DR

DinoSR introduces a novel self-supervised speech representation learning method combining self-distillation and online clustering, leading to improved performance on downstream speech tasks.

Contribution

It presents a new framework that integrates masked language modeling, self-distillation, and online clustering for speech representation learning.

Findings

01

Surpasses previous state-of-the-art in several downstream tasks

02

Effectively learns discrete phonetic units from speech data

03

Provides detailed analysis of learned representations

Abstract

In this paper, we introduce self-distillation and online clustering for self-supervised speech representation learning (DinoSR) which combines masked language modeling, self-distillation, and online clustering. We show that these concepts complement each other and result in a strong representation learning model for speech. DinoSR first extracts contextualized embeddings from the input audio with a teacher network, then runs an online clustering system on the embeddings to yield a machine-discovered phone inventory, and finally uses the discretized tokens to guide a student network. We show that DinoSR surpasses previous state-of-the-art performance in several downstream tasks, and provide a detailed analysis of the model and the learned discrete units.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

alexander-h-liu/dinosr
pytorchOfficial

Models

Videos

DinoSR: Self-Distillation and Online Clustering for Self-supervised Speech Representation Learning· slideslive

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and dialogue systems