Inverse Approximation Theory for Nonlinear Recurrent Neural Networks
Shida Wang, Zhong Li, Qianxiao Li

TL;DR
This paper establishes an inverse approximation theorem for nonlinear RNNs, revealing that functions they can effectively learn must exhibit exponential decay in memory, thus highlighting fundamental limitations and proposing a reparameterization method.
Contribution
It extends the curse of memory concept from linear to nonlinear RNNs and provides a reparameterization approach to mitigate these limitations.
Findings
Nonlinear RNNs can only effectively approximate functions with exponential decaying memory.
Theoretical results demonstrate fundamental limitations of RNNs for long-term dependency learning.
Numerical experiments confirm the theoretical predictions and effectiveness of the proposed reparameterization.
Abstract
We prove an inverse approximation theorem for the approximation of nonlinear sequence-to-sequence relationships using recurrent neural networks (RNNs). This is a so-called Bernstein-type result in approximation theory, which deduces properties of a target function under the assumption that it can be effectively approximated by a hypothesis space. In particular, we show that nonlinear sequence relationships that can be stably approximated by nonlinear RNNs must have an exponential decaying memory structure - a notion that can be made precise. This extends the previously identified curse of memory in linear RNNs into the general nonlinear setting, and quantifies the essential limitations of the RNN architecture for learning sequential relationships with long-term memory. Based on the analysis, we propose a principled reparameterization method to overcome the limitations. Our theoretical…
Peer Reviews
Decision·ICLR 2024 spotlight
The paper is relatively well written and it extends the previous theories in linear RNN. While heuristically it is not difficult to show that a popular RNN, such as LSTM or GRU, has an exponentially decaying memory, by approximating the internal state of the RNN with a relaxation equation, this manuscript provides a robust proof about such behavior.
While it the manuscript is clearly written with well defined problem set up, my concern is the fit to the scope of ICLR. On one hand, I believe that the analysis and definitions provided in the manuscript can be of interest in developing theories about RNN. On the other hand, the scope of the manuscript is too narrow and it is unclear what insight this study provides to benefit a broader class of RNNs or Deep Learning in general. I believe that the authors need to consider giving more insights f
The paper is well presented and pleasant to read. Extending previously known results to the non-linear case is interesting and brings added value. It gives a theoretical framework to understand the common saying that RNNs are unable to learn long-time dependencies in time series. A modification of the parametrization of RNNs is proposed to remedy this problem. Technically, the paper introduces a new notion of memory that holds for nonlinear functionals and which extends the linear case, as well
I have no strong reservations about the paper. I do have a question about Figure 3, which I did not understand. I am willing to raise the rating should this interrogation be answered. [Update on Nov. 18: the authors clarified this point in their comments below, and I updated my rating accordingly]. I do not understand the filtering part of the experiment. Since we know from the theoretical results that RNNs can only represent exponentially decaying functions, the teacher models should all be ex
As an outsider to the field, the paper introduces/uses formal concepts that I find insightful: - The memory function that is used in the paper mathematically characterizes the intuitive behavior of RNNs. - The notion of stable approximation is an interesting proxy for how a function can be learned by gradient descent which, from my very limited knowledge, seems to be rare in approximation theory. Overall, I found overall the paper well written, in a way that is accessible to a decently large a
I am not qualified enough to identify the weaknesses of the paper.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Model Reduction and Neural Networks · Image and Signal Denoising Methods
