Helix Parallelism: Rethinking Sharding Strategies for Interactive Multi-Million-Token LLM Decoding

Nidhi Bhatia; Ankit More; Ritika Borkar; Tiyasa Mitra; Ramon Matas; Ritchie Zhao; Maximilian Golub; Dheevatsa Mudigere; Brian Pharris; Bita Darvish Rouhani

arXiv:2507.07120·cs.DC·July 11, 2025

Helix Parallelism: Rethinking Sharding Strategies for Interactive Multi-Million-Token LLM Decoding

Nidhi Bhatia, Ankit More, Ritika Borkar, Tiyasa Mitra, Ramon Matas, Ritchie Zhao, Maximilian Golub, Dheevatsa Mudigere, Brian Pharris, Bita Darvish Rouhani

PDF

Open Access

TL;DR

Helix Parallelism offers a hybrid sharding strategy for large language models that significantly reduces latency and improves throughput for decoding with long token histories, enabling real-time inference.

Contribution

It introduces Helix Parallelism, a novel hybrid sharding approach combining KV parallelism and tensor parallelism, with a lightweight communication step to optimize multi-GPU decoding.

Findings

01

Reduces token-to-token latency by up to 1.5x.

02

Supports up to 32x larger batch sizes under the same latency.

03

Enables real-time inference with ultra-long sequences.

Abstract

As LLMs scale to multi-million-token KV histories, real-time autoregressive decoding under tight Token-to-Token Latency (TTL) constraints faces growing pressure. Two core bottlenecks dominate: accessing Feed-Forward Network (FFN) weights and reading long KV caches. While Tensor Parallelism (TP) helps mitigate the cost of FFN weight reads, it does not scale well for attention. When TP width exceeds the number of KV heads, it leads to inefficient KV duplication, limits parallelism, and constrains batch size. Simultaneously, DRAM reads for long KV histories scale linearly with batch size, further capping efficiency. We introduce Helix Parallelism, a hybrid execution strategy that applies KV parallelism during attention to shard KV caches across GPUs, then reuses the same GPUs for TP in dense LLMs or TPxExpert Parallel (EP) in MoEs during FFN computation. To preserve exact attention…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Tensor decomposition and applications · Advanced Neural Network Applications