Metrology for AI: From Benchmarks to Instruments
Chris Welty, Praveen Paritosh, Lora Aroyo

TL;DR
This paper advocates applying metrology principles to AI system evaluation, emphasizing the importance of measurement instruments and their characteristics to improve benchmarking accuracy.
Contribution
It introduces the concept of using metrology to formalize AI evaluation, highlighting the need to report instrument precision and resolution in benchmarks.
Findings
Crowd-sourced datasets are commonly used as measurement instruments.
Current AI benchmarks often lack reporting of measurement variance.
Applying metrology can improve the reliability of AI performance comparisons.
Abstract
In this paper we present the first steps towards hardening the science of measuring AI systems, by adopting metrology, the science of measurement and its application, and applying it to human (crowd) powered evaluations. We begin with the intuitive observation that evaluating the performance of an AI system is a form of measurement. In all other science and engineering disciplines, the devices used to measure are called instruments, and all measurements are recorded with respect to the characteristics of the instruments used. One does not report mass, speed, or length, for example, of a studied object without disclosing the precision (measurement variance) and resolution (smallest detectable change) of the instrument used. It is extremely common in the AI literature to compare the performance of two systems by using a crowd-sourced dataset as an instrument, but failing to report if the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMobile Crowdsensing and Crowdsourcing · Anomaly Detection Techniques and Applications · Data Stream Mining Techniques
