Label-Synchronous Speech-to-Text Alignment for ASR Using Forward and   Backward Transformers

Yusuke Kida; Tatsuya Komatsu; Masahito Togami

arXiv:2104.10328·eess.AS·April 22, 2021·1 cites

Label-Synchronous Speech-to-Text Alignment for ASR Using Forward and Backward Transformers

Yusuke Kida, Tatsuya Komatsu, Masahito Togami

PDF

Open Access

TL;DR

This paper introduces a label-synchronous alignment method for ASR using forward and backward Transformers, significantly improving alignment accuracy and reducing error rates in Japanese speech recognition tasks.

Contribution

It presents a novel label-synchronous alignment approach leveraging Transformer models, outperforming conventional frame-synchronous methods in accuracy and ASR performance.

Findings

01

Achieved 0.2% alignment error rate on Japanese CSJ corpus.

02

Reduced character error rates by up to 59% using aligned data.

03

Outperformed traditional CTC-based alignment methods.

Abstract

This paper proposes a novel label-synchronous speech-to-text alignment technique for automatic speech recognition (ASR). The speech-to-text alignment is a problem of splitting long audio recordings with un-aligned transcripts into utterance-wise pairs of speech and text. Unlike conventional methods based on frame-synchronous prediction, the proposed method re-defines the speech-to-text alignment as a label-synchronous text mapping problem. This enables an accurate alignment benefiting from the strong inference ability of the state-of-the-art attention-based encoder-decoder models, which cannot be applied to the conventional methods. Two different Transformer models named forward Transformer and backward Transformer are respectively used for estimating an initial and final tokens of a given speech segment based on end-of-sentence prediction with teacher-forcing. Experiments using the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Softmax · Layer Normalization · Label Smoothing · Residual Connection · Byte Pair Encoding