Understanding Inference Scaling for LLMs: Bottlenecks, Trade-offs, and Performance Principles

Moiz Arif; Avinash Maurya; Sudharshan Vazhkudai; Bogdan Nicolae

arXiv:2605.19775·cs.DC·May 20, 2026

Understanding Inference Scaling for LLMs: Bottlenecks, Trade-offs, and Performance Principles

Moiz Arif, Avinash Maurya, Sudharshan Vazhkudai, Bogdan Nicolae

PDF

TL;DR

This paper analyzes the system bottlenecks and trade-offs in scaling inference for reasoning-centric large language models, providing a framework for optimizing infrastructure at different model scales.

Contribution

It offers a comprehensive characterization of inference bottlenecks across model sizes and proposes strategies for efficient scaling in reasoning workloads.

Findings

01

Data parallelism hits capacity limits on reasoning workloads due to KV-cache fragmentation.

02

Tensor parallelism provides sublinear gains and helps unlock stranded memory at large scales.

03

Dense models are memory and interconnect bound, while sparse MoE models are limited by routing and synchronization.

Abstract

The transition from standard generative AI to \emph{reasoning-centric architectures}, exemplified by models capable of extensive Chain-of-Thought~(CoT) processing, marks a fundamental paradigm shift in system requirements. Unlike traditional workloads dominated by compute-bound prefill, reasoning workloads generate long chains of reasoning tokens that shift inference into a \emph{Capacity-Bound regime}. This paper presents a comprehensive system characterization, evaluating models ranging from 8B to 671B parameters on GPUs clusters. By systematically exploring the interplay between Data, Tensor, and Pipeline parallelism, we identify critical bottlenecks that defy standard scaling heuristics. Our analysis reveals that data parallelism is throughput efficient for small models but hits a capacity trap on reasoning workloads as KV-cache fragmentation forces early throttling resulting in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.