On Evaluating Performance of LLM Inference Serving Systems
Amey Agrawal, Nitin Kedia, Anmol Agarwal, Jayashree Mohan, Nipun Kwatra, Souvik Kundu, Ramachandran Ramjee, Alexey Tumanov

TL;DR
This paper critically examines current evaluation practices for LLM inference systems, identifies common anti-patterns that distort performance assessment, and proposes a comprehensive framework to improve evaluation robustness and relevance.
Contribution
It introduces a systematic analysis of evaluation anti-patterns in LLM inference, providing a checklist and demonstrating its application through a case study on speculative decoding.
Findings
Identification of key anti-patterns in evaluation methodologies
Development of a checklist to avoid evaluation pitfalls
Case study demonstrating improved evaluation practices
Abstract
The rapid evolution of Large Language Model (LLM) inference systems has yielded significant efficiency improvements. However, our systematic analysis reveals that current evaluation methodologies frequently exhibit fundamental flaws, often manifesting as common evaluation anti-patterns that obscure true performance characteristics and impede scientific progress. Through a comprehensive examination of recent systems, we identify recurring anti-patterns across three key dimensions: Baseline Fairness, Evaluation Setup, and Metric Design. These anti-patterns are uniquely problematic for LLM inference due to its dual-phase nature combining distinct prefill and decode operations, its handling of highly heterogeneous workloads, and its strict temporal requirements for interactive use. We demonstrate how common anti-patterns -- such as inadequate baseline comparisons that conflate engineering…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsService-Oriented Architecture and Web Services · Distributed and Parallel Computing Systems
