Self-supervised Learning with Random-projection Quantizer for Speech   Recognition

Chung-Cheng Chiu; James Qin; Yu Zhang; Jiahui Yu; Yonghui Wu

arXiv:2202.01855·cs.CL·July 1, 2022·29 cites

Self-supervised Learning with Random-projection Quantizer for Speech Recognition

Chung-Cheng Chiu, James Qin, Yu Zhang, Jiahui Yu, Yonghui Wu

PDF

Open Access 4 Repos

TL;DR

This paper introduces a simple self-supervised speech recognition method using a fixed random-projection quantizer, achieving competitive results and lower latency compared to existing models, especially in streaming and multilingual settings.

Contribution

The paper proposes a novel self-supervised learning approach with a fixed random-projection quantizer that is not trained, enhancing flexibility and performance in speech recognition.

Findings

01

Achieves similar WER to non-streaming models on LibriSpeech

02

Provides lower WER and latency than wav2vec 2.0 and w2v-BERT in streaming mode

03

Significantly improves multilingual speech recognition results

Abstract

We present a simple and effective self-supervised learning approach for speech recognition. The approach learns a model to predict the masked speech signals, in the form of discrete labels generated with a random-projection quantizer. In particular the quantizer projects speech inputs with a randomly initialized matrix, and does a nearest-neighbor lookup in a randomly-initialized codebook. Neither the matrix nor the codebook is updated during self-supervised learning. Since the random-projection quantizer is not trained and is separated from the speech recognition model, the design makes the approach flexible and is compatible with universal speech recognition architecture. On LibriSpeech our approach achieves similar word-error-rates as previous work using self-supervised learning with non-streaming models, and provides lower word-error-rates and latency than wav2vec 2.0 and w2v-BERT…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing