Graph Signal Processing Meets Mamba2: Adaptive Filter Bank via Delta Modulation
Yehjin Shin, Seojin Kim, Noseong Park

TL;DR
This paper introduces HADES, a GSP-inspired hierarchical adaptive filter bank framework for SSMs like Mamba2, improving efficiency and interpretability while maintaining competitive performance across language tasks.
Contribution
HADES reinterprets Mamba2 as an adaptive filter bank on a line graph, introducing hierarchical filters for global and local behaviors, bridging GSP and neural sequence modeling.
Findings
HADES achieves comparable performance to Mamba2 on multiple benchmarks.
HADES uses only 58.9% of the parameters of baseline models.
HADES provides a structured, interpretable filtering approach for SSMs.
Abstract
State-space models (SSMs) offer efficient alternatives to attention with linear-time recurrence. Mamba2, a recent SSM-based language model, uses selective input gating and a multi-head structure, enabling parallel computation and strong benchmark performance. However, its multi-head recurrence operates independently without structured utilization or analysis. In this work, we propose a novel method called Hierarchical ADaptive filter bank for Efficient SSMs (HADES), a Graph Signal Processing (GSP)-inspired framework that reinterprets Mamba2 as an adaptive filter bank on a line graph. Our hierarchical architecture introduces two filter types: shared filters for global low-pass behavior and expert filters for local high-pass behavior, achieved through structured bias on the parameter {\Delta}. HADES achieves comparable performance to baseline models including Mamba2 across various…
Peer Reviews
Decision·ICLR 2026 Poster
1. The paper introduces an interesting conceptual link between Graph Signal Processing and Mamba2, re-framing multi head SSMs as filter banks. This GSP perspective motivates a novel routing mechanism based on a spectral residual which is a creative approach to token-adaptive computation. 2. The authors conduct a thorough set of ablations on their proposed 370M parameter model, which helps validate the components of their design, such as the auxiliary losses and the hierarchical filter structure
1. The paper's primary weakness is that all experiments are confined to a single, small 370M parameter model. The central claim of 58.9% parameter savings is not validated at larger scales (e.g., 1B+), where model dynamics and efficiency trade-offs are known to change. This severely limits the generality and impact of the findings. 2. The paper's core premise of a GSP framework is not very strong. The authors admit that "spectral properties are not explicitly enforced" but rather "indirectly en
- The formalization of SSM heads into Graph Signal Processing's filter bank is clear and allows principled analysis about low-pass and adaptive behavior. - The routing/bias mechanism tied to \Delta_{HADES} gives a minimal hook for content-adaptive dynamics which aligns with SSM parametrization. - Competitive results at a much lower parameter count regime and performance improvements in long-context tasks. - The ablation study and analysis are thorough.
- While the competence of HADES with respect to the reduced number of parameters seems promising, the architecture does seem to cause more FLOP overhead. Listing this analysis would strengthen the contributions of this work. - The spectral analysis (FFT) on hidden sequences from one layer may be confounded with the layer or the gamma value.
Originality: - Recasts multi-head Mamba2 as a graph filter bank on a line graph, connecting LTV SSMs to graph signal processing (GSP) and framing heads as node-variant graph filters. Introduces a novel architecture HADES, a hierarchical filter bank with (i) always-on shared filters and (ii) token-routed expert filters, selected via a spectral residual and $\Delta$-modulation. - Their construction of expert filters creates more opportunity for modular/interpretable filters, as they are trained t
While I enjoyed the paper's presentation and ideas overall, I think the major weakness of the paper is the strength of their empirical evidence. I will list this in two major axes: 1. **Lack of scale**: Despite the thorough experiments in ablation, sensitivity, and multiple baselines, the paper only operates on the 200B-token Pile training of 370M-parameter models. Because we only get one data point in the (number of tokens trained, model scale), it is hard to know if the model will scale or hol
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Graph Neural Networks · Graph Theory and Algorithms · Topic Modeling
