EfficientASR: Speech Recognition Network Compression via Attention   Redundancy and Chunk-Level FFN Optimization

Jianzong Wang; Ziqi Liang; Xulong Zhang; Ning Cheng; Jing Xiao

arXiv:2404.19214·cs.SD·May 1, 2024

EfficientASR: Speech Recognition Network Compression via Attention Redundancy and Chunk-Level FFN Optimization

Jianzong Wang, Ziqi Liang, Xulong Zhang, Ning Cheng, Jing Xiao

PDF

Open Access

TL;DR

EfficientASR introduces a lightweight Transformer-based speech recognition model that reduces computational redundancy and parameters, achieving comparable or better accuracy on public datasets.

Contribution

The paper presents EfficientASR, a novel model with shared attention and chunk-level feedforward modules, improving efficiency over traditional Transformer speech recognizers.

Findings

01

36% reduction in model parameters

02

0.3% CER improvement on Aishell-1

03

0.2% CER improvement on HKUST

Abstract

In recent years, Transformer networks have shown remarkable performance in speech recognition tasks. However, their deployment poses challenges due to high computational and storage resource requirements. To address this issue, a lightweight model called EfficientASR is proposed in this paper, aiming to enhance the versatility of Transformer models. EfficientASR employs two primary modules: Shared Residual Multi-Head Attention (SRMHA) and Chunk-Level Feedforward Networks (CFFN). The SRMHA module effectively reduces redundant computations in the network, while the CFFN module captures spatial knowledge and reduces the number of parameters. The effectiveness of the EfficientASR model is validated on two public datasets, namely Aishell-1 and HKUST. Experimental results demonstrate a 36% reduction in parameters compared to the baseline Transformer network, along with improvements of 0.3%…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing