FusionFormer: Fusing Operations in Transformer for Efficient Streaming Speech Recognition
Xingchen Song, Di Wu, Binbin Zhang, Zhiyong Wu, Wenpeng Li, Dongfang, Li, Pengshen Zhang, Zhendong Peng, Fuping Pan, Changbao Zhu, Zhongqin Wu

TL;DR
FusionFormer introduces a normalization and activation strategy in Conformer models that replaces Layer Normalization with Batch Normalization and simpler activations, enabling operator fusion for faster inference without sacrificing accuracy.
Contribution
The paper proposes a novel normalization and activation scheme in Conformer models that reduces inference time by 10% through operator fusion, maintaining performance.
Findings
FusionFormer achieves comparable accuracy to LN-based Conformer.
Inference speed is improved by approximately 10%.
Operator fusion eliminates additional inference costs.
Abstract
The recently proposed Conformer architecture which combines convolution with attention to capture both local and global dependencies has become the \textit{de facto} backbone model for Automatic Speech Recognition~(ASR). Inherited from the Natural Language Processing (NLP) tasks, the architecture takes Layer Normalization~(LN) as a default normalization technique. However, through a series of systematic studies, we find that LN might take 10\% of the inference time despite that it only contributes to 0.1\% of the FLOPs. This motivates us to replace LN with other normalization techniques, e.g., Batch Normalization~(BN), to speed up inference with the help of operator fusion methods and the avoidance of calculating the mean and variance statistics during inference. After examining several plain attempts which directly remove all LN layers or replace them with BN in the same place, we find…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsConvolution · Sigmoid Activation · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
