Zipformer: A faster and better encoder for automatic speech recognition
Zengwei Yao, Liyong Guo, Xiaoyu Yang, Wei Kang, Fangjun Kuang, Yifan, Yang, Zengrui Jin, Long Lin, Daniel Povey

TL;DR
Zipformer is an improved, faster, and more memory-efficient encoder for automatic speech recognition that outperforms existing models through novel architecture modifications and a new optimizer.
Contribution
The paper introduces Zipformer, a novel encoder architecture with innovative modeling techniques and a new optimizer, achieving superior ASR performance and efficiency.
Findings
Outperforms state-of-the-art ASR models on multiple datasets
Faster convergence with the new ScaledAdam optimizer
More memory-efficient and better-performing encoder architecture
Abstract
The Conformer has become the most popular encoder model for automatic speech recognition (ASR). It adds convolution modules to a transformer to learn both local and global dependencies. In this work we describe a faster, more memory-efficient, and better-performing transformer, called Zipformer. Modeling changes include: 1) a U-Net-like encoder structure where middle stacks operate at lower frame rates; 2) reorganized block structure with more modules, within which we re-use attention weights for efficiency; 3) a modified form of LayerNorm called BiasNorm allows us to retain some length information; 4) new activation functions SwooshR and SwooshL work better than Swish. We also propose a new optimizer, called ScaledAdam, which scales the update by each tensor's current scale to keep the relative change about the same, and also explictly learns the parameter scale. It achieves faster…
Peer Reviews
Decision·ICLR 2024 oral
1. A newly designed conformer variant that achieve SOTA performance on speech recognition task. 2. Experiments are done on different datasets with training data at different scales (hundreds, thousands and tens of thousands). 3. Experimental results are strong, indicating the effectiveness of model.
1. While the motivation for biasnorm and scaledadam are well explained, the motivation for zipformer blcok, especially those downsampling and upsampling modules are not well presented. 2. The results on aishell1 is not quite convincing compared to other conformer variants. The author could elaborate more on the performance. Could be that this is a small dataset?
- There are a lot of interesting novelties here in the paper, like the ZipFormer model itself, ScaledAdam, BiasNorm, new activation functions, and more. (Although having so many different novelties is also a weakness, see below.) - Improvements are very good, i.e. good relative WER improvements, while also having it more efficient. - Good ASR baselines.
- There are maybe too many new things being introduced here, which are all interesting by themselves, but each of them would maybe require more investigation and analysis on their own. E.g. introducing a new optimizer (ScaledAdam) is interesting, but it should be tested on a couple of different models and benchmarks, and this would basically a work on its own. Now we have way too little analysis for each of the introduced methods to really tell how good they are. The ablation study is basically
This work presents an alternative approach to the standard (or widely used) transformer encoder structure, which is very interesting to read. The authors explain motivations behind some of the proposed modification, together with the ablation studies.
From my biased view, the major weakness in this paper lies in the fact that this work actually presents two inter-connected but (arguably) separate works: one is to the novel new encoder structure, including the U-net with middle stacks operate at a lower frame rates, sharing the attention weight with two self attentions, a novel non-linear attention, a BiasNorm and a slightly modified swooshL/swooshR activation function; the other is about the modified Adam optimizer, scaledAdam, which the auth
Code & Models
- 🤗reazon-research/reazonspeech-k2-v2model· ♡ 24♡ 24
- 🤗csukuangfj/reazonspeech-k2-v2model
- 🤗marcoyang/spear-xlarge-speech-audiomodel· 53k dl· ♡ 453k dl♡ 4
- 🤗marcoyang/spear-large-speechmodel· 14 dl14 dl
- 🤗marcoyang/spear-large-speech-audiomodel· 141 dl141 dl
- 🤗marcoyang/spear-base-speechmodel· 38 dl38 dl
- 🤗marcoyang/spear-base-speech-audiomodel· 6 dl· ♡ 26 dl♡ 2
- 🤗marcoyang/spear-base-speech-audio-v2model· 92 dl92 dl
- 🤗marcoyang/spear-base-speech-v2model· 133 dl133 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Natural Language Processing Techniques
MethodsSigmoid Activation · Convolution · Adam
