CoBERT: Self-Supervised Speech Representation Learning Through Code   Representation Learning

Chutong Meng; Junyi Ao; Tom Ko; Mingxuan Wang; Haizhou Li

arXiv:2210.04062·cs.SD·July 6, 2023

CoBERT: Self-Supervised Speech Representation Learning Through Code Representation Learning

Chutong Meng, Junyi Ao, Tom Ko, Mingxuan Wang, Haizhou Li

PDF

Open Access 1 Repo

TL;DR

CoBERT introduces a self-supervised speech representation learning method that converts speech into discrete codes and predicts code representations from masked speech, leading to state-of-the-art results in ASR and speech translation tasks.

Contribution

It presents a novel code-based self-supervised learning approach for speech, using cross-modality prediction to improve speech recognition and translation performance.

Findings

01

Outperforms recent state-of-the-art on ASR tasks

02

Achieves significant improvements on SUPERB speech translation benchmark

03

Demonstrates effectiveness of code-based self-supervised learning in speech

Abstract

Speech is the surface form of a finite set of phonetic units, which can be represented by discrete codes. We propose the Code BERT (CoBERT) approach for self-supervised speech representation learning. The idea is to convert an utterance to a sequence of discrete codes, and perform code representation learning, where we predict the code representations based on a masked view of the original speech input. Unlike the prior self-distillation approaches of which the teacher and the student are of the same modality, our target model predicts representations from a different modality. CoBERT outperforms the most recent state-of-the-art performance on the ASR task and brings significant improvements on the SUPERB speech translation (ST) task. Our code and models are released at https://github.com/mct10/CoBERT.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mct10/cobert
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and dialogue systems