A Systematic Characterization of LLM Inference on GPUs
Haonan Wang, Xuxin Xiao, Mingyu Yan, Zhuoyuan Zhu, Dengke Han, Duo Wang, Wenming Li, Xiaochun Ye, Cunchen Hu, Hongyang Chen, Guangyu Sun

TL;DR
This paper systematically analyzes LLM inference on GPUs, establishing an analytical framework that uncovers hardware causes, system scaling principles, and emerging paradigms, providing empirical insights and optimization guidance.
Contribution
It introduces a comprehensive four-dimensional framework for understanding LLM inference on GPUs, combining empirical observations with hardware analysis and future paradigm exploration.
Findings
Identifies performance phenomena in LLM inference
Reveals hardware root causes affecting performance
Provides practical optimization strategies
Abstract
This work presents a systematic characterization of Large Language Model (LLM) inference to address fragmented understanding. Through comprehensive experiments, we establish a four-dimensional analytical framework: (1) Two-Phase Heterogeneity Observation; (2) Microarchitectural Root Cause Analysis; (3) System Scaling Principles; and (4) Emerging Paradigm Boundaries. Our investigation progresses systematically from observation to foresight: identifying performance phenomena, revealing hardware causes, validating system behavior, and exploring new paradigms. This study not only consolidates a reliable empirical foundation for existing research but also provides new discoveries and practical optimization guidance for LLM inference.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Big Data and Digital Economy · Advanced Neural Network Applications
