Analysis of memory in LSTM-RNNs for source separation
Jeroen Zegers, Hugo Van hamme

TL;DR
This paper investigates what information LSTM-RNNs retain over time in speech separation tasks, revealing that short-term linguistic info is stored briefly while speaker traits last longer, with deeper layers and bidirectional models enhancing performance.
Contribution
It introduces a memory reset method to evaluate LSTM-RNNs' stored information over time, providing insights into their memory dynamics in speech processing.
Findings
Short-term linguistic info stored less than 100 ms.
Speaker characteristics retained over 400 ms.
Deeper layers suffice for longer memory.
Abstract
Long short-term memory recurrent neural networks (LSTM-RNNs) are considered state-of-the art in many speech processing tasks. The recurrence in the network, in principle, allows any input to be remembered for an indefinite time, a feature very useful for sequential data like speech. However, very little is known about which information is actually stored in the LSTM and for how long. We address this problem by using a memory reset approach which allows us to evaluate network performance depending on the allowed memory time span. We apply this approach to the task of multi-speaker source separation, but it can be used for any task using RNNs. We find a strong performance effect of short-term (shorter than 100 milliseconds) linguistic processes. Only speaker characteristics are kept in the memory for longer than 400 milliseconds. Furthermore, we confirm that performance-wise it is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
