DFSMN-SAN with Persistent Memory Model for Automatic Speech Recognition
Zhao You, Dan Su, Jie Chen, Chao Weng, Dong Yu

TL;DR
This paper introduces a novel speech recognition model combining DFSMN and self-attention with augmented memory, significantly improving accuracy by leveraging extended contextual information beyond the entire utterance.
Contribution
It proposes a new architecture integrating DFSMN with self-attention and novel memory structures, enhancing speech recognition performance over existing SAN models.
Findings
DFSMN-SAN outperforms vanilla SAN by 5% in CER.
Additional memory structures improve CER by 5-11%.
Model achieves state-of-the-art results on large-scale LVCSR tasks.
Abstract
Self-attention networks (SAN) have been introduced into automatic speech recognition (ASR) and achieved state-of-the-art performance owing to its superior ability in capturing long term dependency. One of the key ingredients is the self-attention mechanism which can be effectively performed on the whole utterance level. In this paper, we try to investigate whether even more information beyond the whole utterance level can be exploited and beneficial. We propose to apply self-attention layer with augmented memory to ASR. Specifically, we first propose a variant model architecture which combines deep feed-forward sequential memory network (DFSMN) with self-attention layers to form a better baseline model compared with a purely self-attention network. Then, we propose and compare two kinds of additional memory structures added into self-attention layers. Experiments on large-scale LVCSR…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
MethodsTest · Memory Network
