Benchmarks as Microscopes: A Call for Model Metrology
Michael Saxon, Ari Holtzman, Peter West, William Yang Wang, Naomi, Saphra

TL;DR
This paper advocates for a new discipline called model metrology, emphasizing dynamic benchmarks to accurately assess language model capabilities and ensure reliable deployment performance.
Contribution
It introduces the concept of model metrology as a new approach to benchmarking, focusing on dynamic assessments to better predict real-world performance of language models.
Findings
Static benchmarks saturate and lack deployment confidence
Dynamic assessments provide more reliable capability measurement
Community building is essential for advancing model metrology
Abstract
Modern language models (LMs) pose a new challenge in capability assessment. Static benchmarks inevitably saturate without providing confidence in the deployment tolerances of LM-based systems, but developers nonetheless claim that their models have generalized traits such as reasoning or open-domain language understanding based on these flawed metrics. The science and practice of LMs requires a new approach to benchmarking which measures specific capabilities with dynamic assessments. To be confident in our metrics, we need a new discipline of model metrology -- one which focuses on how to generate benchmarks that predict performance under deployment. Motivated by our evaluation criteria, we outline how building a community of model metrology practitioners -- one focused on building tools and studying how to measure system capabilities -- is the best way to meet these needs to and add…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
