Transformer Transducer: A Streamable Speech Recognition Model with   Transformer Encoders and RNN-T Loss

Qian Zhang; Han Lu; Hasim Sak; Anshuman Tripathi; Erik McDermott,; Stephen Koo; Shankar Kumar

arXiv:2002.02562·eess.AS·February 18, 2020·27 cites

Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss

Qian Zhang, Han Lu, Hasim Sak, Anshuman Tripathi, Erik McDermott,, Stephen Koo, Shankar Kumar

PDF

Open Access 5 Repos

TL;DR

This paper introduces a streaming speech recognition model using Transformer encoders with RNN-T loss, achieving high accuracy with limited future context and demonstrating the effectiveness of self-attention in real-time applications.

Contribution

The paper presents a novel Transformer-based streaming speech recognition model trained with RNN-T loss, outperforming previous methods on LibriSpeech benchmarks.

Findings

01

Limited left context in self-attention maintains accuracy with reduced computation.

02

Full attention Transformer surpasses state-of-the-art accuracy.

03

Attending to a few future frames improves performance close to full attention.

Abstract

In this paper we present an end-to-end speech recognition model with Transformer encoders that can be used in a streaming speech recognition system. Transformer computation blocks based on self-attention are used to encode both audio and label sequences independently. The activations from both audio and label encoders are combined with a feed-forward layer to compute a probability distribution over the label space for every combination of acoustic frame position and label history. This is similar to the Recurrent Neural Network Transducer (RNN-T) model, which uses RNNs for information encoding instead of Transformer encoders. The model is trained with the RNN-T loss well-suited to streaming decoding. We present results on the LibriSpeech dataset showing that limiting the left context for self-attention in the Transformer layers makes decoding computationally tractable for streaming,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam · Softmax