Transformers with convolutional context for ASR
Abdelrahman Mohamed, Dmytro Okhonko, Luke Zettlemoyer

TL;DR
This paper introduces a convolutionally learned positional embedding for transformers in speech recognition, improving long-range relationship modeling and achieving competitive WER results without warmup or learning rate scheduling.
Contribution
It replaces sinusoidal positional embeddings with convolutional representations, enhancing transformer performance and optimization stability in ASR tasks.
Findings
Achieved 4.7% WER on Librispeech test clean
Achieved 12.9% WER on Librispeech test other
No warmup steps needed for training
Abstract
The recent success of transformer networks for neural machine translation and other NLP tasks has led to a surge in research work trying to apply it for speech recognition. Recent efforts studied key research questions around ways of combining positional embedding with speech features, and stability of optimization for large scale learning of transformer networks. In this paper, we propose replacing the sinusoidal positional embedding for transformers with convolutionally learned input representations. These contextual representations provide subsequent transformer blocks with relative positional information needed for discovering long-range relationships between local concepts. The proposed system has favorable optimization characteristics where our reported results are produced with fixed learning rate of 1.0 and no warmup steps. The proposed model achieves a competitive 4.7% and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Topic Modeling
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam · Softmax
