Simplified Self-Attention for Transformer-based End-to-End Speech   Recognition

Haoneng Luo; Shiliang Zhang; Ming Lei; Lei Xie

arXiv:2005.10463·cs.SD·November 18, 2020·1 cites

Simplified Self-Attention for Transformer-based End-to-End Speech Recognition

Haoneng Luo, Shiliang Zhang, Ming Lei, Lei Xie

PDF

Open Access

TL;DR

This paper introduces a simplified self-attention layer for transformer models in speech recognition, reducing model complexity by over 20% while maintaining or improving recognition accuracy across multiple Mandarin speech datasets.

Contribution

The paper proposes a novel simplified self-attention (SSAN) layer using FSMN memory blocks, decreasing model parameters without sacrificing performance.

Findings

01

Over 20% reduction in model parameters on AISHELL-1.

02

6.7% relative CER reduction with SSAN on AISHELL-1.

03

No performance loss on large-scale 20,000-hour task.

Abstract

Transformer models have been introduced into end-to-end speech recognition with state-of-the-art performance on various tasks owing to their superiority in modeling long-term dependencies. However, such improvements are usually obtained through the use of very large neural networks. Transformer models mainly include two submodules - position-wise feedforward layers and self-attention (SAN) layers. In this paper, to reduce the model complexity while maintaining good performance, we propose a simplified self-attention (SSAN) layer which employs FSMN memory block instead of projection layers to form query and key vectors for transformer-based end-to-end speech recognition. We evaluate the SSAN-based and the conventional SAN-based transformers on the public AISHELL-1, internal 1000-hour and 20,000-hour large-scale Mandarin tasks. Results show that our proposed SSAN-based transformer model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Label Smoothing · Multi-Head Attention · Adam · *Communicated@Fast*How Do I Communicate to Expedia? · Dropout · Byte Pair Encoding