Streaming Transformer-based Acoustic Models Using Self-attention with   Augmented Memory

Chunyang Wu; Yongqiang Wang; Yangyang Shi; Ching-Feng Yeh; Frank Zhang

arXiv:2005.08042·eess.AS·May 19, 2020·1 cites

Streaming Transformer-based Acoustic Models Using Self-attention with Augmented Memory

Chunyang Wu, Yongqiang Wang, Yangyang Shi, Ching-Feng Yeh, Frank Zhang

PDF

Open Access

TL;DR

This paper introduces an augmented-memory self-attention mechanism for streaming transformer-based acoustic models, enabling efficient processing of long sequences with reduced computational cost, and demonstrating significant improvements on speech recognition benchmarks.

Contribution

The paper proposes a novel augmented-memory self-attention method that allows streaming transformers to handle long sequences efficiently by attending to a memory bank, improving performance over existing methods.

Findings

01

Outperforms all existing streamable transformer methods on Librispeech

02

Achieves over 15% relative error reduction compared to LC-BLSTM baseline

03

Confirmed effectiveness on large internal datasets

Abstract

Transformer-based acoustic modeling has achieved great suc-cess for both hybrid and sequence-to-sequence speech recogni-tion. However, it requires access to the full sequence, and thecomputational cost grows quadratically with respect to the in-put sequence length. These factors limit its adoption for stream-ing applications. In this work, we proposed a novel augmentedmemory self-attention, which attends on a short segment of theinput sequence and a bank of memories. The memory bankstores the embedding information for all the processed seg-ments. On the librispeech benchmark, our proposed methodoutperforms all the existing streamable transformer methods bya large margin and achieved over 15% relative error reduction,compared with the widely used LC-BLSTM baseline. Our find-ings are also confirmed on some large internal datasets.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Label Smoothing · Multi-Head Attention · Adam · *Communicated@Fast*How Do I Communicate to Expedia? · Dropout · Byte Pair Encoding