Understanding Inference Scaling for LLMs: Bottlenecks, Trade-offs, and Performance Principles
Moiz Arif, Avinash Maurya, Sudharshan Vazhkudai, Bogdan Nicolae

TL;DR
This paper analyzes the system bottlenecks and trade-offs in scaling inference for reasoning-centric large language models, providing a framework for optimizing infrastructure at different model scales.
Contribution
It offers a comprehensive characterization of inference bottlenecks across model sizes and proposes strategies for efficient scaling in reasoning workloads.
Findings
Data parallelism hits capacity limits on reasoning workloads due to KV-cache fragmentation.
Tensor parallelism provides sublinear gains and helps unlock stranded memory at large scales.
Dense models are memory and interconnect bound, while sparse MoE models are limited by routing and synchronization.
Abstract
The transition from standard generative AI to \emph{reasoning-centric architectures}, exemplified by models capable of extensive Chain-of-Thought~(CoT) processing, marks a fundamental paradigm shift in system requirements. Unlike traditional workloads dominated by compute-bound prefill, reasoning workloads generate long chains of reasoning tokens that shift inference into a \emph{Capacity-Bound regime}. This paper presents a comprehensive system characterization, evaluating models ranging from 8B to 671B parameters on GPUs clusters. By systematically exploring the interplay between Data, Tensor, and Pipeline parallelism, we identify critical bottlenecks that defy standard scaling heuristics. Our analysis reveals that data parallelism is throughput efficient for small models but hits a capacity trap on reasoning workloads as KV-cache fragmentation forces early throttling resulting in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
