Project MPG: towards a generalized performance benchmark for LLM capabilities
Lucas Spangher, Tianle Li, William F. Arnold, Nick Masiewicki, Xerxes, Dotiwalla, Rama Parusmathi, Peter Grabowski, Eugene Ie, Dan Gruhl

TL;DR
Project MPG introduces a novel aggregation method for evaluating LLMs across diverse benchmarks, providing a comprehensive performance metric that combines accuracy and efficiency, aiding decision-making for non-experts.
Contribution
It proposes a new generalizable scoring schema for LLM benchmarking that combines accuracy and speed into two interpretable metrics, improving comparison across models.
Findings
High correlation with existing benchmarks like Chatbot Arena
Improves upon MMLU leaderboard correlation
Provides a unified performance and efficiency score
Abstract
There exists an extremely wide array of LLM benchmarking tasks, whereas oftentimes a single number is the most actionable for decision-making, especially by non-experts. No such aggregation schema exists that is not Elo-based, which could be costly or time-consuming. Here we propose a method to aggregate performance across a general space of benchmarks, nicknamed Project "MPG," dubbed Model Performance and Goodness, additionally referencing a metric widely understood to be an important yet inaccurate and crude measure of car performance. Here, we create two numbers: a "Goodness" number (answer accuracy) and a "Fastness" number (cost or QPS). We compare models against each other and present a ranking according to our general metric as well as subdomains. We find significant agreement between the raw Pearson correlation of our scores and those of Chatbot Arena, even improving on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed and Parallel Computing Systems · Fault Detection and Control Systems · Advanced Data Storage Technologies
