Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition
Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav, Artzi

TL;DR
This paper investigates the trade-offs between performance and efficiency in unsupervised speech recognition models, introducing SEW, a new architecture that improves both metrics significantly over wav2vec 2.0.
Contribution
The paper formalizes architecture design choices affecting speech recognition performance and efficiency, and proposes SEW, a model with notable improvements in speed and accuracy.
Findings
SEW achieves 1.9x faster inference than wav2vec 2.0 on LibriSpeech.
SEW reduces word error rate by up to 50% at similar inference times.
SEW outperforms wav2vec 2.0 across various model sizes.
Abstract
This paper is a study of performance-efficiency trade-offs in pre-trained models for automatic speech recognition (ASR). We focus on wav2vec 2.0, and formalize several architecture designs that influence both the model performance and its efficiency. Putting together all our observations, we introduce SEW (Squeezed and Efficient Wav2vec), a pre-trained model architecture with significant improvements along both performance and efficiency dimensions across a variety of training setups. For example, under the 100h-960h semi-supervised setup on LibriSpeech, SEW achieves a 1.9x inference speedup compared to wav2vec 2.0, with a 13.5% relative reduction in word error rate. With a similar inference time, SEW reduces word error rate by 25-50% across different model sizes.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗asapp/sew-d-base-100kmodel· 4 dl4 dl
- 🤗asapp/sew-d-base-plus-100kmodel· 4 dl4 dl
- 🤗asapp/sew-d-base-plus-400k-ft-ls100hmodel· 11 dl· ♡ 411 dl♡ 4
- 🤗asapp/sew-d-base-plus-400kmodel· 2 dl2 dl
- 🤗asapp/sew-d-mid-100kmodel· 1 dl1 dl
- 🤗asapp/sew-d-mid-400k-ft-ls100hmodel· 51 dl· ♡ 151 dl♡ 1
- 🤗asapp/sew-d-mid-400kmodel· 4 dl· ♡ 14 dl♡ 1
- 🤗asapp/sew-d-mid-k127-100kmodel· 3 dl3 dl
- 🤗asapp/sew-d-mid-k127-400k-ft-ls100hmodel· 6 dl6 dl
- 🤗asapp/sew-d-mid-k127-400kmodel· 2 dl2 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling
