HOPE for a Robust Parameterization of Long-memory State Space Models
Annan Yu, Michael W. Mahoney, N. Benjamin Erichson

TL;DR
This paper introduces HOPE, a novel Hankel operator-based parameterization for linear time-invariant state-space models, enhancing initialization, training stability, and long-memory capabilities for sequence learning tasks.
Contribution
The paper develops a new HOPE parameterization scheme for LTI systems using Hankel operators, improving robustness and efficiency over traditional methods.
Findings
Enhanced performance on Long-Range Arena tasks.
Improved training stability and initialization.
Demonstrated non-decaying memory in experiments.
Abstract
State-space models (SSMs) that utilize linear, time-invariant (LTI) systems are known for their effectiveness in learning long sequences. To achieve state-of-the-art performance, an SSM often needs a specifically designed initialization, and the training of state matrices is on a logarithmic scale with a very small learning rate. To understand these choices from a unified perspective, we view SSMs through the lens of Hankel operator theory. Building upon it, we develop a new parameterization scheme, called HOPE, for LTI systems that utilizes Markov parameters within Hankel operators. Our approach helps improve the initialization and training stability, leading to a more robust parameterization. We efficiently implement these innovations by nonuniformly sampling the transfer functions of LTI systems, and they require fewer parameters compared to canonical SSMs. When benchmarked against…
Peer Reviews
Decision·ICLR 2025 Poster
This paper proposes a theoretical framework for analyzing SSMs, and there are many supporting graphs and experiments. The plots are well-executed, and the experiments are LRA has a good set of baselines. I also appreciate the noise-padded sCIFAR-10 experiment as an ablation. The algorithm is clearly written.
1. Motivation: random init in models like Mamba also works, so the initialization issues seem to be limited to certain SSMs. 2. HOPE-SSM does not have non-decaying memory: these are asymptotically stable systems, instead of marginally stable systems, so they cannot have arbitrarily long memory. This point is not clarified upfront in the paper. 3. HOPE-SSM on Path-X, which is the task with the longest memory in LRA, does not outperform S5. This raises some concerns on the long-memory capability o
1. I believe HOPE is very well motivated by Section 3. The analysis presented in that section clearly illustrates all the issues with current methods based on standard parametrization. In a nutshell, Figures 2 and 3 perfectly summarize the entire section. 2. The three advantages of HOPE are well explained and essential. It is impressive that the authors managed to achieve all three benefits with a simple new parametrization. 3. The entire framework seems very simple, yet powerful.
Some results lack sufficient motivation. For example, the setup introduced before Theorem 1 could be better explained—why is it reasonable to sample entries of $A$ from $F_a$ and entries of $\overline{B}\circ\overline{C}^\top$ from a standard normal distribution? Similarly, the discussion preceding Algorithm 1, as well as the algorithm itself, appears more complicated than necessary, with essentially the same information repeated three times. I would prefer having only Algorithm 1 with the full
1. Strong empirical and theoretical evidence in favour of the proposed parameterisation. 2. Clearly written. 3. The algorithm is novel, albeit SSM working in the frequency domain have been proposed by (Agarwal et al. 2023).
1. Some limitations are not experimentally investigated. In particular, the method seems to weigh the input time steps belonging to a fixed window equally and discard the others. This could be a limitation but all tasks considered in the experiments do not seem to benefit from a recency bias (see also the first question). 2. Limited discussion of Related works (Minor).
Code & Models
Videos
Taxonomy
TopicsParallel Computing and Optimization Techniques · Advanced Memory and Neural Computing · Distributed systems and fault tolerance
MethodsHigh-Order Proximity preserved Embedding
