LMEC: Learnable Multiplicative Absolute Position Embedding Based   Conformer for Speech Recognition

Yuguang Yang; Yu Pan; Jingjing Yin; Heng Lu

arXiv:2212.02099·eess.AS·December 6, 2022

LMEC: Learnable Multiplicative Absolute Position Embedding Based Conformer for Speech Recognition

Yuguang Yang, Yu Pan, Jingjing Yin, Heng Lu

PDF

Open Access 1 Repo

TL;DR

This paper introduces LMEC, a Conformer variant with a learnable multiplicative absolute position embedding and kernelized linear attention, achieving faster inference and improved accuracy in speech recognition.

Contribution

It proposes a novel LM-APE re-weighting mechanism and replaces FFN with GLU, enhancing performance and efficiency in long sequence speech recognition.

Findings

01

Achieves up to 0.63% WER reduction on LibriSpeech test-other.

02

Speeds up inference by up to 33%.

03

Outperforms previous linear attention-based Conformers.

Abstract

This paper proposes a Learnable Multiplicative absolute position Embedding based Conformer (LMEC). It contains a kernelized linear attention (LA) module called LMLA to solve the time-consuming problem for long sequence speech recognition as well as an alternative to the FFN structure. First, the ELU function is adopted as the kernel function of our proposed LA module. Second, we propose a novel Learnable Multiplicative Absolute Position Embedding (LM-APE) based re-weighting mechanism that can reduce the well-known quadratic temporal-space complexity of softmax self-attention. Third, we use Gated Linear Units (GLU) to substitute the Feed Forward Network (FFN) for better performance. Extensive experiments have been conducted on the public LibriSpeech datasets. Compared to the Conformer model with cosFormer style linear attention, our proposed method can achieve up to 0.63% word-error-rate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yygle/LMLA
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsExponential Linear Unit · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Softmax