Let SSMs be ConvNets: State-space Modeling with Optimal Tensor Contractions
Yan Ru Pei

TL;DR
The paper introduces Centaurus, a flexible SSM-based neural network architecture optimized through tensor contraction order, achieving state-of-the-art audio processing performance without traditional recurrence or attention mechanisms.
Contribution
It presents a novel network design that treats SSM operations as tensor contractions, optimizing their order for efficiency, and demonstrates superior audio processing results.
Findings
Outperforms homogeneous SSM networks in audio tasks
First fully state-space based ASR network without recurrence or attention
Achieves competitive performance with flexible, efficient design
Abstract
We introduce Centaurus, a class of networks composed of generalized state-space model (SSM) blocks, where the SSM operations can be treated as tensor contractions during training. The optimal order of tensor contractions can then be systematically determined for every SSM block to maximize training efficiency. This allows more flexibility in designing SSM blocks beyond the depthwise-separable configuration commonly implemented. The new design choices will take inspiration from classical convolutional blocks including group convolutions, full convolutions, and bottleneck blocks. We architect the Centaurus network with a mixture of these blocks, to balance between network size and performance, as well as memory and computational efficiency during both training and inference. We show that this heterogeneous network design outperforms its homogeneous counterparts in raw audio processing…
Peer Reviews
Decision·ICLR 2025 Spotlight
1. The concept of generalized SSM blocks, configurable with flexible connectivity structures, demonstrates good soundness and could complement existing SSM designs. 2. The proposed method is supported by both theoretical analysis and empirical experiments.
1. The major concern is that the real-device training/inference efficiency of the proposed generalized SSM blocks is not provided. It is unclear whether the flexibility of these building blocks comes at the cost of reduced real-device efficiency. 2. Another major concern is that the position of the proposed method among the latest SOTA SSMs is unclear. For example, Mamba has introduced input adaptivity in their selective state design and employed an advanced macro-structure with gating mechanis
- The presentation of methodology is convincing and strongly tied to first principle sin SSMs. - The application of tensor networks to solve the problem of projecting kernel matrices to and from frequency spaces of different dimensions is novel and elegant. - The presentation of implementation and computational considerations shows care was given to scaling tradeoffs, i.e. memory-boundedness of this regime of compute regime, opportunities (or lack thereof) for operator fusion, and operation/cont
- Baselines could be much stronger, in that there are no other SSM model baselines. Ablations are mostly within the architectural innovations present with Centaurus. - The constraint that some projection matrices from basis kernels must be real constrains the expressiveness of the model, although it’s likely that such generalizations in future work might enable this. - A more comprehensive architectural ablation would make the paper stronger. While there are explanations of architectures in app
Their hybrid architecture shows better results at lower FLOPs, while retaining a similar scaling behaviour compared to other homogeneous models across model sizes.
The proposed model is a combination of known primitives (SSMs and inhomogeneous scaling from CNN architectures). The work does not cite and compare to other relevant, related work that potentially outperforms the trained models on the given datasets, e.g. Zhao et al. 2022: "Monaural speech enhancement with complex convolutional block attention module and joint time frequency losses." for the VB-DMD dataset.
Videos
Taxonomy
TopicsTensor decomposition and applications
MethodsSoftmax · Attention Is All You Need
