On Evaluating Performance of LLM Inference Serving Systems

Amey Agrawal; Nitin Kedia; Anmol Agarwal; Jayashree Mohan; Nipun Kwatra; Souvik Kundu; Ramachandran Ramjee; Alexey Tumanov

arXiv:2507.09019·cs.LG·July 15, 2025

On Evaluating Performance of LLM Inference Serving Systems

Amey Agrawal, Nitin Kedia, Anmol Agarwal, Jayashree Mohan, Nipun Kwatra, Souvik Kundu, Ramachandran Ramjee, Alexey Tumanov

PDF

Open Access

TL;DR

This paper critically examines current evaluation practices for LLM inference systems, identifies common anti-patterns that distort performance assessment, and proposes a comprehensive framework to improve evaluation robustness and relevance.

Contribution

It introduces a systematic analysis of evaluation anti-patterns in LLM inference, providing a checklist and demonstrating its application through a case study on speculative decoding.

Findings

01

Identification of key anti-patterns in evaluation methodologies

02

Development of a checklist to avoid evaluation pitfalls

03

Case study demonstrating improved evaluation practices

Abstract

The rapid evolution of Large Language Model (LLM) inference systems has yielded significant efficiency improvements. However, our systematic analysis reveals that current evaluation methodologies frequently exhibit fundamental flaws, often manifesting as common evaluation anti-patterns that obscure true performance characteristics and impede scientific progress. Through a comprehensive examination of recent systems, we identify recurring anti-patterns across three key dimensions: Baseline Fairness, Evaluation Setup, and Metric Design. These anti-patterns are uniquely problematic for LLM inference due to its dual-phase nature combining distinct prefill and decode operations, its handling of highly heterogeneous workloads, and its strict temporal requirements for interactive use. We demonstrate how common anti-patterns -- such as inadequate baseline comparisons that conflate engineering…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsService-Oriented Architecture and Web Services · Distributed and Parallel Computing Systems