LIMINAL: Exploring The Frontiers of LLM Decode Performance
Michael Davies, Neal Crago, Karthikeyan Sankaralingam, Christos Kozyrakis

TL;DR
This paper introduces LIMINAL, an analytical model to explore the performance limits of LLM inference on various hardware, identifying key bottlenecks and challenges for future hardware and algorithmic improvements.
Contribution
The paper develops LIMINAL, a systematic performance model for LLM inference, and provides insights into hardware bottlenecks and future challenges for scaling LLM performance.
Findings
LIMINAL accurately predicts LLM inference performance with 7.6% error.
Memory bandwidth and capacity are primary bottlenecks.
Achieving >10,000 tokens/sec requires hardware and algorithmic advances.
Abstract
The rapid advancement of Large Language Models (LLMs) necessitates a deep understanding of their fundamental performance limits. This paper investigates the limits of LLM inference, focusing on hardware-imposed bottlenecks in auto-regressive decoding. We develop LIMINAL, an analytical performance model that abstracts application requirements and hardware capabilities to systematically explore performance and efficiency across a wide range of current, near-future, and hypothetical hardware. We find LIMINAL is accurate when comparing to LLMs executing on existing hardware, achieving a mean absolute error of . Our analysis spans from current HBM3 memory technology used in AI accelerators like GPUs and TPUs to systems based on advanced HBM4 and advanced 3D-stacked DRAM technology. We identify five non-negotiable challenges for LLM inference hardware, establishing compute, memory…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Topic Modeling
