Advanced Long-context End-to-end Speech Recognition Using Context-expanded Transformers
Takaaki Hori, Niko Moritz, Chiori Hori, Jonathan Le Roux

TL;DR
This paper enhances long-context end-to-end speech recognition by integrating Conformer architecture, introducing a novel decoding acceleration technique, and enabling streaming, resulting in state-of-the-art accuracy and faster decoding.
Contribution
It extends previous work by incorporating Conformer, a new decoding acceleration method, and streaming capabilities into long-context speech recognition models.
Findings
Achieved 17.3% CER on HKUST dataset.
Reduced decoding time by over 50%.
Enabled streaming ASR with minimal accuracy loss.
Abstract
This paper addresses end-to-end automatic speech recognition (ASR) for long audio recordings such as lecture and conversational speeches. Most end-to-end ASR models are designed to recognize independent utterances, but contextual information (e.g., speaker or topic) over multiple utterances is known to be useful for ASR. In our prior work, we proposed a context-expanded Transformer that accepts multiple consecutive utterances at the same time and predicts an output sequence for the last utterance, achieving 5-15% relative error reduction from utterance-based baselines in lecture and conversational ASR benchmarks. Although the results have shown remarkable performance gain, there is still potential to further improve the model architecture and the decoding process. In this paper, we extend our prior work by (1) introducing the Conformer architecture to further improve the accuracy, (2)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Softmax · Layer Normalization · Label Smoothing · Residual Connection · Multi-Head Attention · Byte Pair Encoding · Dropout
