Swimba: Switch Mamba Model Scales State Space Models
Zhixu Du, Krishna Teja Chitty-Venkata, Murali Emani, Venkatram Vishwanath, Hai Helen Li, Yiran Chen

TL;DR
Swimba introduces a novel mixture-of-experts approach for state space models that enhances capacity without increasing recurrence costs, demonstrating improved performance and efficiency on benchmark tasks.
Contribution
The paper proposes Switch Mamba (Swimba), a new MoE design for SSMs that maintains computational efficiency while increasing model capacity through expert routing in parameter space.
Findings
Swimba slightly outperforms baseline models on benchmarks.
Swimba maintains similar latency and throughput under matched FLOPs.
Theoretical analysis confirms stability of MoE-parameterized SSMs.
Abstract
Mixture-of-experts (MoE) is a common approach for increasing parameter capacity, but applying MoE to state space model (SSM) token mixers can multiply the cost of the recurrent state update. We study how to introduce expert specialization into selective SSMs while preserving computational efficiency. We show that MoE--SSM can refer to two designs: (1) MoE over separated SSMs, which maintains multiple state trajectories and thus scales compute with the number of experts; and (2) MoE-parameterized SSM, which mixes experts in parameter space, maintains a single state trajectory, and evaluates the recurrence once. Our method, Switch Mamba (Swimba), follows the second design by routing over expert-produced SSM streams. Theoretically, we establish well-definedness and stability for MoE-parameterized SSMs and characterize the relationship between the two designs. Empirically, we evaluate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Stream Mining Techniques · Software System Performance and Reliability · Time Series Analysis and Forecasting
