Economic Evaluation of LLMs

Michael J. Zellinger; Matt Thomson

arXiv:2507.03834·cs.AI·July 8, 2025

Economic Evaluation of LLMs

Michael J. Zellinger, Matt Thomson

PDF

TL;DR

This paper introduces an economic evaluation framework for LLMs that quantifies performance trade-offs based on real-world costs, enabling better comparison of models for practical applications.

Contribution

It proposes a novel economic evaluation method for LLMs that incorporates cost factors like error, latency, and abstention, improving decision-making in model deployment.

Findings

01

Reasoning models outperform non-reasoning ones when mistake costs exceed $0.01.

02

Large LLMs often outperform cascades at mistake costs as low as $0.1.

03

Practitioners should prioritize powerful models over cost minimization in real-world tasks.

Abstract

Practitioners often navigate LLM performance trade-offs by plotting Pareto frontiers of optimal accuracy-cost trade-offs. However, this approach offers no way to compare between LLMs with distinct strengths and weaknesses: for example, a cheap, error-prone model vs a pricey but accurate one. To address this gap, we propose economic evaluation of LLMs. Our framework quantifies the performance trade-off of an LLM as a single number based on the economic constraints of a concrete use case, all expressed in dollars: the cost of making a mistake, the cost of incremental latency, and the cost of abstaining from a query. We apply our economic evaluation framework to compare the performance of reasoning and non-reasoning models on difficult questions from the MATH benchmark, discovering that reasoning models offer better accuracy-cost tradeoffs as soon as the economic cost of a mistake exceeds…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.