Performance-Efficiency Trade-offs in Unsupervised Pre-training for   Speech Recognition

Felix Wu; Kwangyoun Kim; Jing Pan; Kyu Han; Kilian Q. Weinberger; Yoav; Artzi

arXiv:2109.06870·cs.CL·September 15, 2021·5 cites

Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition

Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav, Artzi

PDF

Open Access 1 Repo 10 Models

TL;DR

This paper investigates the trade-offs between performance and efficiency in unsupervised speech recognition models, introducing SEW, a new architecture that improves both metrics significantly over wav2vec 2.0.

Contribution

The paper formalizes architecture design choices affecting speech recognition performance and efficiency, and proposes SEW, a model with notable improvements in speed and accuracy.

Findings

01

SEW achieves 1.9x faster inference than wav2vec 2.0 on LibriSpeech.

02

SEW reduces word error rate by up to 50% at similar inference times.

03

SEW outperforms wav2vec 2.0 across various model sizes.

Abstract

This paper is a study of performance-efficiency trade-offs in pre-trained models for automatic speech recognition (ASR). We focus on wav2vec 2.0, and formalize several architecture designs that influence both the model performance and its efficiency. Putting together all our observations, we introduce SEW (Squeezed and Efficient Wav2vec), a pre-trained model architecture with significant improvements along both performance and efficiency dimensions across a variety of training setups. For example, under the 100h-960h semi-supervised setup on LibriSpeech, SEW achieves a 1.9x inference speedup compared to wav2vec 2.0, with a 13.5% relative reduction in word error rate. With a similar inference time, SEW reduces word error rate by 25-50% across different model sizes.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

asappresearch/sew
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling