Scaling State-Space Models on Multiple GPUs with Tensor Parallelism
Anurag Dutt, Nimit Shah, Hazem Masarani, Anshul Gandhi

TL;DR
This paper introduces a communication-efficient tensor parallelism method for scaling selective state space models on multiple GPUs, significantly improving inference throughput especially for long-context workloads.
Contribution
The paper proposes a novel tensor parallelism design tailored for selective SSMs, addressing key engineering challenges to enable efficient multi-GPU inference.
Findings
Achieves 1.6-2.1x throughput gain on 2 GPUs
Achieves 2.6-4.0x throughput gain on 4 GPUs
Further improves throughput by 10-18% using quantized all-reduce
Abstract
Selective state space models (SSMs) have rapidly become a compelling backbone for large language models, especially for long-context workloads. Yet in deployment, their inference performance is often bounded by the memory capacity, bandwidth, and latency limits of a single GPU, making multi-GPU execution increasingly necessary. Although tensor parallelism (TP) is widely used to scale Transformer inference, applying it to selective SSM blocks is non-trivial because the SSM mixer couples large projections with a sequence-wise recurrent state update and local mixing whose efficiency depends on preserving locality and avoiding synchronization in the critical path. This paper presents a communication-efficient TP design for selective SSM inference that addresses three practical engineering challenges: enabling TTFT improvements via an SSM state cache across prefill and decode, partitioning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Tensor decomposition and applications · Error Correcting Code Techniques
