Streaming Transformer-based Acoustic Models Using Self-attention with Augmented Memory
Chunyang Wu, Yongqiang Wang, Yangyang Shi, Ching-Feng Yeh, Frank Zhang

TL;DR
This paper introduces an augmented-memory self-attention mechanism for streaming transformer-based acoustic models, enabling efficient processing of long sequences with reduced computational cost, and demonstrating significant improvements on speech recognition benchmarks.
Contribution
The paper proposes a novel augmented-memory self-attention method that allows streaming transformers to handle long sequences efficiently by attending to a memory bank, improving performance over existing methods.
Findings
Outperforms all existing streamable transformer methods on Librispeech
Achieves over 15% relative error reduction compared to LC-BLSTM baseline
Confirmed effectiveness on large internal datasets
Abstract
Transformer-based acoustic modeling has achieved great suc-cess for both hybrid and sequence-to-sequence speech recogni-tion. However, it requires access to the full sequence, and thecomputational cost grows quadratically with respect to the in-put sequence length. These factors limit its adoption for stream-ing applications. In this work, we proposed a novel augmentedmemory self-attention, which attends on a short segment of theinput sequence and a bank of memories. The memory bankstores the embedding information for all the processed seg-ments. On the librispeech benchmark, our proposed methodoutperforms all the existing streamable transformer methods bya large margin and achieved over 15% relative error reduction,compared with the widely used LC-BLSTM baseline. Our find-ings are also confirmed on some large internal datasets.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Label Smoothing · Multi-Head Attention · Adam · *Communicated@Fast*How Do I Communicate to Expedia? · Dropout · Byte Pair Encoding
