Scaling State-Space Models on Multiple GPUs with Tensor Parallelism

Anurag Dutt; Nimit Shah; Hazem Masarani; Anshul Gandhi

arXiv:2602.21144·cs.DC·February 25, 2026

Scaling State-Space Models on Multiple GPUs with Tensor Parallelism

Anurag Dutt, Nimit Shah, Hazem Masarani, Anshul Gandhi

PDF

Open Access

TL;DR

This paper introduces a communication-efficient tensor parallelism method for scaling selective state space models on multiple GPUs, significantly improving inference throughput especially for long-context workloads.

Contribution

The paper proposes a novel tensor parallelism design tailored for selective SSMs, addressing key engineering challenges to enable efficient multi-GPU inference.

Findings

01

Achieves 1.6-2.1x throughput gain on 2 GPUs

02

Achieves 2.6-4.0x throughput gain on 4 GPUs

03

Further improves throughput by 10-18% using quantized all-reduce

Abstract

Selective state space models (SSMs) have rapidly become a compelling backbone for large language models, especially for long-context workloads. Yet in deployment, their inference performance is often bounded by the memory capacity, bandwidth, and latency limits of a single GPU, making multi-GPU execution increasingly necessary. Although tensor parallelism (TP) is widely used to scale Transformer inference, applying it to selective SSM blocks is non-trivial because the SSM mixer couples large projections with a sequence-wise recurrent state update and local mixing whose efficiency depends on preserving locality and avoiding synchronization in the critical path. This paper presents a communication-efficient TP design for selective SSM inference that addresses three practical engineering challenges: enabling TTFT improvements via an SSM state cache across prefill and decode, partitioning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Tensor decomposition and applications · Error Correcting Code Techniques