Parameter-Efficient Conformers via Sharing Sparsely-Gated Experts for   End-to-End Speech Recognition

Ye Bai; Jie Li; Wenjing Han; Hao Ni; Kaituo Xu; Zhuo Zhang; Cheng Yi,; Xiaorui Wang

arXiv:2209.08326·eess.AS·September 20, 2022

Parameter-Efficient Conformers via Sharing Sparsely-Gated Experts for End-to-End Speech Recognition

Ye Bai, Jie Li, Wenjing Han, Hao Ni, Kaituo Xu, Zhuo Zhang, Cheng Yi,, Xiaorui Wang

PDF

Open Access

TL;DR

This paper introduces a parameter-efficient conformer model for speech recognition that uses sparsely-gated experts and shared parameters, maintaining performance while significantly reducing memory usage.

Contribution

It proposes a novel conformer architecture with sparsely-gated mixture-of-experts and shared blocks, enhancing capacity without increasing computation or parameters.

Findings

01

Achieves comparable performance with only one-third of the encoder parameters.

02

Uses MoE routers and normalization to maintain flexibility in shared blocks.

03

Employs knowledge distillation to further boost model accuracy.

Abstract

While transformers and their variant conformers show promising performance in speech recognition, the parameterized property leads to much memory cost during training and inference. Some works use cross-layer weight-sharing to reduce the parameters of the model. However, the inevitable loss of capacity harms the model performance. To address this issue, this paper proposes a parameter-efficient conformer via sharing sparsely-gated experts. Specifically, we use sparsely-gated mixture-of-experts (MoE) to extend the capacity of a conformer block without increasing computation. Then, the parameters of the grouped conformer blocks are shared so that the number of parameters is reduced. Next, to ensure the shared blocks with the flexibility of adapting representations at different levels, we design the MoE routers and normalization individually. Moreover, we use knowledge distillation to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsKnowledge Distillation