Simplified Self-Attention for Transformer-based End-to-End Speech Recognition
Haoneng Luo, Shiliang Zhang, Ming Lei, Lei Xie

TL;DR
This paper introduces a simplified self-attention layer for transformer models in speech recognition, reducing model complexity by over 20% while maintaining or improving recognition accuracy across multiple Mandarin speech datasets.
Contribution
The paper proposes a novel simplified self-attention (SSAN) layer using FSMN memory blocks, decreasing model parameters without sacrificing performance.
Findings
Over 20% reduction in model parameters on AISHELL-1.
6.7% relative CER reduction with SSAN on AISHELL-1.
No performance loss on large-scale 20,000-hour task.
Abstract
Transformer models have been introduced into end-to-end speech recognition with state-of-the-art performance on various tasks owing to their superiority in modeling long-term dependencies. However, such improvements are usually obtained through the use of very large neural networks. Transformer models mainly include two submodules - position-wise feedforward layers and self-attention (SAN) layers. In this paper, to reduce the model complexity while maintaining good performance, we propose a simplified self-attention (SSAN) layer which employs FSMN memory block instead of projection layers to form query and key vectors for transformer-based end-to-end speech recognition. We evaluate the SSAN-based and the conventional SAN-based transformers on the public AISHELL-1, internal 1000-hour and 20,000-hour large-scale Mandarin tasks. Results show that our proposed SSAN-based transformer model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Label Smoothing · Multi-Head Attention · Adam · *Communicated@Fast*How Do I Communicate to Expedia? · Dropout · Byte Pair Encoding
