# Learning to Shard: RL for Co-optimizing the Parallelism Degrees and Per-operator Sharding Dimensions in Distributed LLM Inference

**Authors:** Ruokai Yin, Sattwik Deb Mishra, Xuan Zuo, Hokchhay Tann, Preyas Shah, Apala Guha

arXiv: 2509.00217 · 2025-09-03

## TL;DR

Learn to Shard is an RL-based method that optimizes both parallelism degrees and sharding dimensions in distributed LLM inference, significantly improving throughput over existing heuristics and baselines.

## Contribution

It introduces the first RL approach to co-optimize parallelism and sharding strategies for distributed LLM inference, addressing limitations of static heuristics.

## Key findings

- Up to 3.5x throughput improvement over baselines
- Achieves 1.06x throughput gain over Megatron heuristics
- Effective on large MoE models up to 1.6T parameters

## Abstract

Distributed LLM inference requires careful coordination of parallelization strategies across hundreds to thousands of NPUs to meet production SLOs. Current systems like Megatron-LM rely on static heuristics that separately configure parallelism degrees and per-operator sharding dimensions, leaving significant performance on the table as models scale and hardware topologies diversify. We introduce Learn to Shard, to our knowledge, the first RL-based approach to co-optimize both coarse-grained parallelism degrees and fine-grained per-operator sharding dimensions for distributed LLM inference. Our method employs an attention-based policy over an elite history that learns from high-performing strategies to efficiently navigate the vast combinatorial search space. Evaluated on H100 clusters with MoE models up to 1.6T parameters, Learn to Shard achieves up to 3.5x throughput improvement over metaheuristic baselines and 1.06x over Megatron heuristics.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2509.00217/full.md

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/2509.00217/full.md

## References

20 references — full list in the complete paper: https://tomesphere.com/paper/2509.00217/full.md

---
Source: https://tomesphere.com/paper/2509.00217