Inference economics of language models

Ege Erdil

arXiv:2506.04645·cs.LG·June 6, 2025

Inference economics of language models

Ege Erdil

PDF

Open Access 1 Repo

TL;DR

This paper presents a theoretical model analyzing the economic trade-offs in deploying large language models, optimizing for speed and cost considering hardware constraints and parallelism strategies.

Contribution

It introduces a novel theoretical framework that models inference economics of LLMs, optimizing parallelism and batch size for cost-effective, fast serial inference.

Findings

01

Computed Pareto frontiers for popular LLMs showing trade-offs between speed and cost.

02

Identified optimal parallelism and batching strategies for different hardware constraints.

03

Provided insights into deploying LLMs efficiently at scale.

Abstract

We develop a theoretical model that addresses the economic trade-off between cost per token versus serial token generation speed when deploying LLMs for inference at scale. Our model takes into account arithmetic, memory bandwidth, network bandwidth and latency constraints; and optimizes over different parallelism setups and batch sizes to find the ones that optimize serial inference speed at a given cost per token. We use the model to compute Pareto frontiers of serial speed versus cost per token for popular language models.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ege-erdil/inference-economics
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Parallel Computing and Optimization Techniques · Machine Learning and Algorithms

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings