LMEC: Learnable Multiplicative Absolute Position Embedding Based Conformer for Speech Recognition
Yuguang Yang, Yu Pan, Jingjing Yin, Heng Lu

TL;DR
This paper introduces LMEC, a Conformer variant with a learnable multiplicative absolute position embedding and kernelized linear attention, achieving faster inference and improved accuracy in speech recognition.
Contribution
It proposes a novel LM-APE re-weighting mechanism and replaces FFN with GLU, enhancing performance and efficiency in long sequence speech recognition.
Findings
Achieves up to 0.63% WER reduction on LibriSpeech test-other.
Speeds up inference by up to 33%.
Outperforms previous linear attention-based Conformers.
Abstract
This paper proposes a Learnable Multiplicative absolute position Embedding based Conformer (LMEC). It contains a kernelized linear attention (LA) module called LMLA to solve the time-consuming problem for long sequence speech recognition as well as an alternative to the FFN structure. First, the ELU function is adopted as the kernel function of our proposed LA module. Second, we propose a novel Learnable Multiplicative Absolute Position Embedding (LM-APE) based re-weighting mechanism that can reduce the well-known quadratic temporal-space complexity of softmax self-attention. Third, we use Gated Linear Units (GLU) to substitute the Feed Forward Network (FFN) for better performance. Extensive experiments have been conducted on the public LibriSpeech datasets. Compared to the Conformer model with cosFormer style linear attention, our proposed method can achieve up to 0.63% word-error-rate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsExponential Linear Unit · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Softmax
