Inference economics of language models
Ege Erdil

TL;DR
This paper presents a theoretical model analyzing the economic trade-offs in deploying large language models, optimizing for speed and cost considering hardware constraints and parallelism strategies.
Contribution
It introduces a novel theoretical framework that models inference economics of LLMs, optimizing parallelism and batch size for cost-effective, fast serial inference.
Findings
Computed Pareto frontiers for popular LLMs showing trade-offs between speed and cost.
Identified optimal parallelism and batching strategies for different hardware constraints.
Provided insights into deploying LLMs efficiently at scale.
Abstract
We develop a theoretical model that addresses the economic trade-off between cost per token versus serial token generation speed when deploying LLMs for inference at scale. Our model takes into account arithmetic, memory bandwidth, network bandwidth and latency constraints; and optimizes over different parallelism setups and batch sizes to find the ones that optimize serial inference speed at a given cost per token. We use the model to compute Pareto frontiers of serial speed versus cost per token for popular language models.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Parallel Computing and Optimization Techniques · Machine Learning and Algorithms
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
