FusionFormer: Fusing Operations in Transformer for Efficient Streaming   Speech Recognition

Xingchen Song; Di Wu; Binbin Zhang; Zhiyong Wu; Wenpeng Li; Dongfang; Li; Pengshen Zhang; Zhendong Peng; Fuping Pan; Changbao Zhu; Zhongqin Wu

arXiv:2210.17079·cs.SD·November 1, 2022

FusionFormer: Fusing Operations in Transformer for Efficient Streaming Speech Recognition

Xingchen Song, Di Wu, Binbin Zhang, Zhiyong Wu, Wenpeng Li, Dongfang, Li, Pengshen Zhang, Zhendong Peng, Fuping Pan, Changbao Zhu, Zhongqin Wu

PDF

Open Access

TL;DR

FusionFormer introduces a normalization and activation strategy in Conformer models that replaces Layer Normalization with Batch Normalization and simpler activations, enabling operator fusion for faster inference without sacrificing accuracy.

Contribution

The paper proposes a novel normalization and activation scheme in Conformer models that reduces inference time by 10% through operator fusion, maintaining performance.

Findings

01

FusionFormer achieves comparable accuracy to LN-based Conformer.

02

Inference speed is improved by approximately 10%.

03

Operator fusion eliminates additional inference costs.

Abstract

The recently proposed Conformer architecture which combines convolution with attention to capture both local and global dependencies has become the \textit{de facto} backbone model for Automatic Speech Recognition~(ASR). Inherited from the Natural Language Processing (NLP) tasks, the architecture takes Layer Normalization~(LN) as a default normalization technique. However, through a series of systematic studies, we find that LN might take 10\% of the inference time despite that it only contributes to 0.1\% of the FLOPs. This motivates us to replace LN with other normalization techniques, e.g., Batch Normalization~(BN), to speed up inference with the help of operator fusion methods and the avoidance of calculating the mean and variance statistics during inference. After examining several plain attempts which directly remove all LN layers or replace them with BN in the same place, we find…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsConvolution · Sigmoid Activation · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings