Zipformer: A faster and better encoder for automatic speech recognition

Zengwei Yao; Liyong Guo; Xiaoyu Yang; Wei Kang; Fangjun Kuang; Yifan; Yang; Zengrui Jin; Long Lin; Daniel Povey

arXiv:2310.11230·eess.AS·April 11, 2024·28 cites

Zipformer: A faster and better encoder for automatic speech recognition

Zengwei Yao, Liyong Guo, Xiaoyu Yang, Wei Kang, Fangjun Kuang, Yifan, Yang, Zengrui Jin, Long Lin, Daniel Povey

PDF

Open Access 1 Repo 9 Models 3 Reviews

TL;DR

Zipformer is an improved, faster, and more memory-efficient encoder for automatic speech recognition that outperforms existing models through novel architecture modifications and a new optimizer.

Contribution

The paper introduces Zipformer, a novel encoder architecture with innovative modeling techniques and a new optimizer, achieving superior ASR performance and efficiency.

Findings

01

Outperforms state-of-the-art ASR models on multiple datasets

02

Faster convergence with the new ScaledAdam optimizer

03

More memory-efficient and better-performing encoder architecture

Abstract

The Conformer has become the most popular encoder model for automatic speech recognition (ASR). It adds convolution modules to a transformer to learn both local and global dependencies. In this work we describe a faster, more memory-efficient, and better-performing transformer, called Zipformer. Modeling changes include: 1) a U-Net-like encoder structure where middle stacks operate at lower frame rates; 2) reorganized block structure with more modules, within which we re-use attention weights for efficiency; 3) a modified form of LayerNorm called BiasNorm allows us to retain some length information; 4) new activation functions SwooshR and SwooshL work better than Swish. We also propose a new optimizer, called ScaledAdam, which scales the update by each tensor's current scale to keep the relative change about the same, and also explictly learns the parameter scale. It achieves faster…

Peer Reviews

Decision·ICLR 2024 oral

Reviewer 01Rating 8· accept, good paperConfidence 4

Strengths

1. A newly designed conformer variant that achieve SOTA performance on speech recognition task. 2. Experiments are done on different datasets with training data at different scales (hundreds, thousands and tens of thousands). 3. Experimental results are strong, indicating the effectiveness of model.

Weaknesses

1. While the motivation for biasnorm and scaledadam are well explained, the motivation for zipformer blcok, especially those downsampling and upsampling modules are not well presented. 2. The results on aishell1 is not quite convincing compared to other conformer variants. The author could elaborate more on the performance. Could be that this is a small dataset?

Reviewer 02Rating 8· accept, good paperConfidence 5

Strengths

- There are a lot of interesting novelties here in the paper, like the ZipFormer model itself, ScaledAdam, BiasNorm, new activation functions, and more. (Although having so many different novelties is also a weakness, see below.) - Improvements are very good, i.e. good relative WER improvements, while also having it more efficient. - Good ASR baselines.

Weaknesses

- There are maybe too many new things being introduced here, which are all interesting by themselves, but each of them would maybe require more investigation and analysis on their own. E.g. introducing a new optimizer (ScaledAdam) is interesting, but it should be tested on a couple of different models and benchmarks, and this would basically a work on its own. Now we have way too little analysis for each of the introduced methods to really tell how good they are. The ablation study is basically

Reviewer 03Rating 6· marginally above the acceptance thresholdConfidence 5

Strengths

This work presents an alternative approach to the standard (or widely used) transformer encoder structure, which is very interesting to read. The authors explain motivations behind some of the proposed modification, together with the ablation studies.

Weaknesses

From my biased view, the major weakness in this paper lies in the fact that this work actually presents two inter-connected but (arguably) separate works: one is to the novel new encoder structure, including the U-net with middle stacks operate at a lower frame rates, sharing the attention weight with two self attentions, a novel non-linear attention, a BiasNorm and a slightly modified swooshL/swooshR activation function; the other is about the modified Adam optimizer, scaledAdam, which the auth

Code & Models

Repositories

k2-fsa/icefall
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Natural Language Processing Techniques

MethodsSigmoid Activation · Convolution · Adam