State Space Models are Provably Comparable to Transformers in Dynamic Token Selection
Naoki Nishikawa, Taiji Suzuki

TL;DR
This paper demonstrates that state space models combined with nonlinear layers are theoretically comparable to Transformers in token selection and function estimation, offering a computationally efficient alternative.
Contribution
It provides the first theoretical analysis showing SSMs with nonlinear layers match Transformers' capabilities in sequence modeling tasks.
Findings
SSMs with nonlinear layers can solve synthetic tasks challenging for single SSM layers.
SSMs are theoretically equivalent to Transformers in nonparametric regression.
SSMs offer a computationally efficient alternative to Transformers.
Abstract
Deep neural networks based on state space models (SSMs) are attracting significant attention in sequence modeling since their computational cost is much smaller than that of Transformers. While the capabilities of SSMs have been demonstrated through experiments in various tasks, theoretical understanding of SSMs is still limited. In particular, most theoretical studies discuss the capabilities of SSM layers without nonlinear layers, and there is a lack of discussion on their combination with nonlinear layers. In this paper, we explore the capabilities of SSMs combined with fully connected neural networks, and show that they are comparable to Transformers in extracting the essential tokens depending on the input. As concrete examples, we consider two synthetic tasks, which are challenging for a single SSM layer, and demonstrate that SSMs combined with nonlinear layers can efficiently…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNeural Networks and Applications · Fault Detection and Control Systems · Control Systems and Identification
