Mambalaya: Einsum-Based Fusion Optimizations on State-Space Models
Toluwanimi O. Odemuyiwa, John D. Owens, Joel S. Emer, Michael Pellauer

TL;DR
Mambalaya is a reconfigurable accelerator that uses an extended Einsum framework to optimize fusion in complex Mamba workloads, significantly reducing memory traffic and improving performance.
Contribution
It introduces a novel reconfigurable architecture and a systematic fusion exploration method based on the cascade-of-Einsums abstraction for Mamba workloads.
Findings
Achieves 4.9× speedup in prefill and 1.9× in generation over MARCA.
Reduces off-chip inter-Einsum traffic through optimized fusion mappings.
Outperforms recent memory-aware fusion accelerators by up to 1.5× in prefill scenarios.
Abstract
Mamba is an emerging, complex workload with various short-range and long-range dependencies, nonlinearities, and elementwise computations that are unable to run at near-peak speeds on modern hardware. Specifically, Mamba's complex dependency graph makes fusion across its full operator cascade difficult, leaving substantial inter-operator memory traffic on the table. To address these challenges, we propose Mambalaya, a novel reconfigurable accelerator that leverages fusion to overcome the limitations of Mamba. We use the recently proposed cascade-of-Einsums abstraction to characterize Mamba's full computational structure, then apply the extended Einsum framework to systematically explore inter-Einsum fusion opportunities. This principled approach yields a series of fusion mappings that reduce off-chip inter-Einsum traffic. These mappings are supported by the underlying Mambalaya…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
