Squeezeformer: An Efficient Transformer for Automatic Speech Recognition

Sehoon Kim; Amir Gholami; Albert Shaw; Nicholas Lee; Karttikeya; Mangalam; Jitendra Malik; Michael W. Mahoney; Kurt Keutzer

arXiv:2206.00888·eess.AS·October 18, 2022·75 cites

Squeezeformer: An Efficient Transformer for Automatic Speech Recognition

Sehoon Kim, Amir Gholami, Albert Shaw, Nicholas Lee, Karttikeya, Mangalam, Jitendra Malik, Michael W. Mahoney, Kurt Keutzer

PDF

Open Access 4 Repos 1 Models 1 Video

TL;DR

Squeezeformer introduces a simplified, efficient transformer architecture for speech recognition that outperforms the Conformer model by optimizing macro and micro-architecture components, achieving state-of-the-art WERs.

Contribution

The paper proposes Squeezeformer, a novel architecture that improves upon Conformer by simplifying design choices and incorporating efficient modules, leading to better performance in ASR tasks.

Findings

01

Squeezeformer achieves 6.0-7.5% WER on LibriSpeech test sets.

02

It outperforms Conformer-CTC with the same FLOPs by 0.6-3.1% WER.

03

The model is more efficient due to the Temporal U-Net and depthwise down-sampling layers.

Abstract

The recently proposed Conformer model has become the de facto backbone model for various downstream speech tasks based on its hybrid attention-convolution architecture that captures both local and global features. However, through a series of systematic studies, we find that the Conformer architecture's design choices are not optimal. After re-examining the design choices for both the macro and micro-architecture of Conformer, we propose Squeezeformer which consistently outperforms the state-of-the-art ASR models under the same training schemes. In particular, for the macro-architecture, Squeezeformer incorporates (i) the Temporal U-Net structure which reduces the cost of the multi-head attention modules on long sequences, and (ii) a simpler block structure of multi-head attention or convolution modules followed up by feed-forward module instead of the Macaron structure proposed in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
TanmayNanda/ishara
model· 1 dl· ♡ 1
1 dl♡ 1

Videos

Squeezeformer: An Efficient Transformer for Automatic Speech Recognition· slideslive

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsSoftmax · Linear Layer · Concatenated Skip Connection · *Communicated@Fast*How Do I Communicate to Expedia? · Max Pooling · U-Net · Convolution · Layer Normalization