Advanced Long-context End-to-end Speech Recognition Using   Context-expanded Transformers

Takaaki Hori; Niko Moritz; Chiori Hori; Jonathan Le Roux

arXiv:2104.09426·cs.CL·April 20, 2021·1 cites

Advanced Long-context End-to-end Speech Recognition Using Context-expanded Transformers

Takaaki Hori, Niko Moritz, Chiori Hori, Jonathan Le Roux

PDF

Open Access

TL;DR

This paper enhances long-context end-to-end speech recognition by integrating Conformer architecture, introducing a novel decoding acceleration technique, and enabling streaming, resulting in state-of-the-art accuracy and faster decoding.

Contribution

It extends previous work by incorporating Conformer, a new decoding acceleration method, and streaming capabilities into long-context speech recognition models.

Findings

01

Achieved 17.3% CER on HKUST dataset.

02

Reduced decoding time by over 50%.

03

Enabled streaming ASR with minimal accuracy loss.

Abstract

This paper addresses end-to-end automatic speech recognition (ASR) for long audio recordings such as lecture and conversational speeches. Most end-to-end ASR models are designed to recognize independent utterances, but contextual information (e.g., speaker or topic) over multiple utterances is known to be useful for ASR. In our prior work, we proposed a context-expanded Transformer that accepts multiple consecutive utterances at the same time and predicts an output sequence for the last utterance, achieving 5-15% relative error reduction from utterance-based baselines in lecture and conversational ASR benchmarks. Although the results have shown remarkable performance gain, there is still potential to further improve the model architecture and the decoding process. In this paper, we extend our prior work by (1) introducing the Conformer architecture to further improve the accuracy, (2)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Softmax · Layer Normalization · Label Smoothing · Residual Connection · Multi-Head Attention · Byte Pair Encoding · Dropout