Position: LLM Inference Should Be Evaluated as Energy-to-Token Production
Xiang Liu, Shimiao Yuan, Zhenheng Tang, Peijie Dong, Kaiyong Zhao, Qiang Wang, Bo Li, Xiaowen Chu

TL;DR
The paper advocates for evaluating large language model inference based on energy-to-token production, emphasizing physical constraints like power and cooling over traditional metrics.
Contribution
It formalizes the energy-to-token production framework and calls for new benchmarking standards including Joules per token and power utilization metrics.
Findings
Inference constraints shift from compute to power and cooling at scale.
System optimizations can be viewed as energy-to-token levers.
Benchmarking should include energy and efficiency metrics alongside accuracy.
Abstract
LLM inference is still evaluated mainly as a model or software problem: accuracy, latency, throughput, and hardware utilization. This is incomplete. At deployment scale, the relevant output is a quality-conditioned token produced under joint constraints from effective compute, delivered data-center power, cooling capacity, PUE, and utilization. We argue that the ML community should treat inference as \emph{energy-to-token production}. We formalize this view with a dimensionally consistent Token Production Function in which token rate is bounded by both compute-per-token and energy-per-token ceilings. Listed API prices vary by over an order of magnitude across providers, but we use price dispersion only as directional motivation, not as causal evidence of marginal cost. The core physical question is instead: under fixed quality and service targets, when does the binding constraint move…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
