
TL;DR
This paper introduces an economic evaluation framework for LLMs that quantifies performance trade-offs based on real-world costs, enabling better comparison of models for practical applications.
Contribution
It proposes a novel economic evaluation method for LLMs that incorporates cost factors like error, latency, and abstention, improving decision-making in model deployment.
Findings
Reasoning models outperform non-reasoning ones when mistake costs exceed $0.01.
Large LLMs often outperform cascades at mistake costs as low as $0.1.
Practitioners should prioritize powerful models over cost minimization in real-world tasks.
Abstract
Practitioners often navigate LLM performance trade-offs by plotting Pareto frontiers of optimal accuracy-cost trade-offs. However, this approach offers no way to compare between LLMs with distinct strengths and weaknesses: for example, a cheap, error-prone model vs a pricey but accurate one. To address this gap, we propose economic evaluation of LLMs. Our framework quantifies the performance trade-off of an LLM as a single number based on the economic constraints of a concrete use case, all expressed in dollars: the cost of making a mistake, the cost of incremental latency, and the cost of abstaining from a query. We apply our economic evaluation framework to compare the performance of reasoning and non-reasoning models on difficult questions from the MATH benchmark, discovering that reasoning models offer better accuracy-cost tradeoffs as soon as the economic cost of a mistake exceeds…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
