Similarity and Content-based Phonetic Self Attention for Speech   Recognition

Kyuhong Shim; Wonyong Sung

arXiv:2203.10252·cs.CL·July 13, 2022

Similarity and Content-based Phonetic Self Attention for Speech Recognition

Kyuhong Shim, Wonyong Sung

PDF

Open Access

TL;DR

This paper introduces a novel phonetic self-attention mechanism combining similarity and content-based attention to enhance speech recognition by capturing more representative phonetic features without increasing latency or model size.

Contribution

The paper proposes a new phonetic self-attention method that improves speech recognition by explicitly modeling phonetic features through dual attention types, a novel approach compared to standard self-attention.

Findings

01

Improved phoneme classification accuracy.

02

Enhanced speech recognition performance.

03

No increase in latency or parameter size.

Abstract

Transformer-based speech recognition models have achieved great success due to the self-attention (SA) mechanism that utilizes every frame in the feature extraction process. Especially, SA heads in lower layers capture various phonetic characteristics by the query-key dot product, which is designed to compute the pairwise relationship between frames. In this paper, we propose a variant of SA to extract more representative phonetic features. The proposed phonetic self-attention (phSA) is composed of two different types of phonetic attention; one is similarity-based and the other is content-based. In short, similarity-based attention captures the correlation between frames while content-based attention only considers each frame without being affected by other frames. We identify which parts of the original dot product equation are related to two different attention patterns and improve…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing

MethodsContent-based Attention