Transformer-Transducer: End-to-End Speech Recognition with   Self-Attention

Ching-Feng Yeh; Jay Mahadeokar; Kaustubh Kalgaonkar; Yongqiang Wang,; Duc Le; Mahaveer Jain; Kjell Schubert; Christian Fuegen; Michael L. Seltzer

arXiv:1910.12977·eess.AS·October 30, 2019·66 cites

Transformer-Transducer: End-to-End Speech Recognition with Self-Attention

Ching-Feng Yeh, Jay Mahadeokar, Kaustubh Kalgaonkar, Yongqiang Wang,, Duc Le, Mahaveer Jain, Kjell Schubert, Christian Fuegen, Michael L. Seltzer

PDF

Open Access 1 Repo

TL;DR

This paper introduces a Transformer-based neural transducer for end-to-end speech recognition, leveraging self-attention, causal convolution, and truncated attention to improve accuracy, efficiency, and streaming capability on LibriSpeech.

Contribution

It proposes a novel Transformer-Transducer model with causal convolution and truncated self-attention, enabling efficient streaming speech recognition with superior accuracy.

Findings

01

Achieved 6.37% WER on LibriSpeech test-clean

02

Reduced computational complexity to O(T)

03

Maintained streaming capability with a compact model

Abstract

We explore options to use Transformer networks in neural transducer for end-to-end speech recognition. Transformer networks use self-attention for sequence modeling and comes with advantages in parallel computation and capturing contexts. We propose 1) using VGGNet with causal convolution to incorporate positional information and reduce frame rate for efficient inference 2) using truncated self-attention to enable streaming for Transformer and reduce computational complexity. All experiments are conducted on the public LibriSpeech corpus. The proposed Transformer-Transducer outperforms neural transducer with LSTM/BLSTM networks and achieved word error rates of 6.37 % on the test-clean set and 15.30 % on the test-other set, while remaining streamable, compact with 45.7M parameters for the entire system, and computationally efficient with complexity of O(T), where T is input sequence…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

msalhab96/SpeeQ
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam · Softmax