Gaussian Kernelized Self-Attention for Long Sequence Data and Its   Application to CTC-based Speech Recognition

Yosuke Kashiwagi; Emiru Tsunoo; Shinji Watanabe

arXiv:2102.09168·eess.AS·February 19, 2021

Gaussian Kernelized Self-Attention for Long Sequence Data and Its Application to CTC-based Speech Recognition

Yosuke Kashiwagi, Emiru Tsunoo, Shinji Watanabe

PDF

Open Access

TL;DR

This paper introduces a Gaussian kernelized self-attention mechanism that improves long sequence data modeling in speech recognition, addressing the accuracy degradation issue in traditional self-attention models.

Contribution

The paper proposes a shift-invariant Gaussian kernelized self-attention architecture with relative position embedding, enhancing long sequence processing in CTC-based speech recognition.

Findings

01

Achieved significant WER reduction from 24.0% to 6.0% on CSJ.

02

Demonstrated improved accuracy on long sequences without windowing.

03

Mathematically linked self-attention to normalized kernel functions.

Abstract

Self-attention (SA) based models have recently achieved significant performance improvements in hybrid and end-to-end automatic speech recognition (ASR) systems owing to their flexible context modeling capability. However, it is also known that the accuracy degrades when applying SA to long sequence data. This is mainly due to the length mismatch between the inference and training data because the training data are usually divided into short segments for efficient training. To mitigate this mismatch, we propose a new architecture, which is a variant of the Gaussian kernel, which itself is a shift-invariant kernel. First, we mathematically demonstrate that self-attention with shared weight parameters for queries and keys is equivalent to a normalized kernel function. By replacing this kernel function with the proposed Gaussian kernel, the architecture becomes completely shift-invariant…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Natural Language Processing Techniques