Benchmarks as Microscopes: A Call for Model Metrology

Michael Saxon; Ari Holtzman; Peter West; William Yang Wang; Naomi; Saphra

arXiv:2407.16711·cs.SE·July 31, 2024

Benchmarks as Microscopes: A Call for Model Metrology

Michael Saxon, Ari Holtzman, Peter West, William Yang Wang, Naomi, Saphra

PDF

TL;DR

This paper advocates for a new discipline called model metrology, emphasizing dynamic benchmarks to accurately assess language model capabilities and ensure reliable deployment performance.

Contribution

It introduces the concept of model metrology as a new approach to benchmarking, focusing on dynamic assessments to better predict real-world performance of language models.

Findings

01

Static benchmarks saturate and lack deployment confidence

02

Dynamic assessments provide more reliable capability measurement

03

Community building is essential for advancing model metrology

Abstract

Modern language models (LMs) pose a new challenge in capability assessment. Static benchmarks inevitably saturate without providing confidence in the deployment tolerances of LM-based systems, but developers nonetheless claim that their models have generalized traits such as reasoning or open-domain language understanding based on these flawed metrics. The science and practice of LMs requires a new approach to benchmarking which measures specific capabilities with dynamic assessments. To be confident in our metrics, we need a new discipline of model metrology -- one which focuses on how to generate benchmarks that predict performance under deployment. Motivated by our evaluation criteria, we outline how building a community of model metrology practitioners -- one focused on building tools and studying how to measure system capabilities -- is the best way to meet these needs to and add…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.