Unsupervised Data Selection via Discrete Speech Representation for ASR

Zhiyun Lu; Yongqiang Wang; Yu Zhang; Wei Han; Zhehuai Chen; Parisa; Haghani

arXiv:2204.01981·eess.AS·April 6, 2022·Interspeech·1 cites

Unsupervised Data Selection via Discrete Speech Representation for ASR

Zhiyun Lu, Yongqiang Wang, Yu Zhang, Wei Han, Zhehuai Chen, Parisa, Haghani

PDF

Open Access

TL;DR

This paper introduces an unsupervised data selection method using discrete speech representations to improve self-supervised ASR, reducing data needs and enhancing performance across multiple languages.

Contribution

It proposes a contrastive data selection technique based on discrete speech tokens that effectively identifies acoustically similar data for better ASR training.

Findings

01

Reduces pre-training data by 94% while improving ASR performance.

02

Achieves 11.8% relative WER reduction on LibriSpeech test-other.

03

Attains over 15% relative WER reduction on multilingual test sets.

Abstract

Self-supervised learning of speech representations has achieved impressive results in improving automatic speech recognition (ASR). In this paper, we show that data selection is important for self-supervised learning. We propose a simple and effective unsupervised data selection method which selects acoustically similar speech to a target domain. It takes the discrete speech representation available in common self-supervised learning frameworks as input, and applies a contrastive data selection method on the discrete tokens. Through extensive empirical studies we show that our proposed method reduces the amount of required pre-training data and improves the downstream ASR performance. Pre-training on a selected subset of 6% of the general data pool results in 11.8% relative improvements in LibriSpeech test-other compared to pre-training on the full set. On Multilingual LibriSpeech…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing