Context-Aware Transformer Transducer for Speech Recognition
Feng-Ju Chang, Jing Liu, Martin Radfar, Athanasios Mouchtaris,, Maurizio Omologo, Ariya Rastrow, Siegfried Kunzmann

TL;DR
This paper introduces a novel context-aware transformer transducer (CATT) for speech recognition that leverages contextual signals, including BERT-based encoding, to significantly improve recognition accuracy, especially for rare words.
Contribution
The paper proposes a multi-head attention-based context-biasing network integrated into a transformer transducer, utilizing BERT and BLSTM encoders for enhanced contextual information handling.
Findings
CATT with BERT encoder reduces word error rate significantly.
Outperforms existing deep contextual models by 24.2% and 19.4%.
Improves recognition of rare words in speech recognition.
Abstract
End-to-end (E2E) automatic speech recognition (ASR) systems often have difficulty recognizing uncommon words, that appear infrequently in the training data. One promising method, to improve the recognition accuracy on such rare words, is to latch onto personalized/contextual information at inference. In this work, we present a novel context-aware transformer transducer (CATT) network that improves the state-of-the-art transformer-based ASR system by taking advantage of such contextual signals. Specifically, we propose a multi-head attention-based context-biasing network, which is jointly trained with the rest of the ASR sub-networks. We explore different techniques to encode contextual data and to create the final attention context vectors. We also leverage both BLSTM and pretrained BERT based models to encode contextual data and guide the network training. Using an in-house far-field…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Topic Modeling
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Linear Warmup With Linear Decay · Layer Normalization · Adam · Attention Dropout · WordPiece · Dropout
