Statistical Uncertainty Quantification for Aggregate Performance Metrics   in Machine Learning Benchmarks

Rachel Longjohn; Giri Gopalan; Emily Casleton

arXiv:2501.04234·stat.ML·January 9, 2025

Statistical Uncertainty Quantification for Aggregate Performance Metrics in Machine Learning Benchmarks

Rachel Longjohn, Giri Gopalan, Emily Casleton

PDF

Open Access

TL;DR

This paper introduces statistical methods like bootstrapping and Bayesian modeling to quantify uncertainty in aggregate performance metrics across multiple machine learning tasks, providing more realistic evaluations.

Contribution

It applies and demonstrates the effectiveness of statistical uncertainty quantification techniques for aggregate ML benchmark metrics, enhancing evaluation accuracy.

Findings

01

Uncertainty quantification reveals model dominance in specific tasks.

02

Bayesian models provide credible intervals for performance metrics.

03

Visualization of task weightings helps interpret model performance.

Abstract

Modern artificial intelligence is supported by machine learning models (e.g., foundation models) that are pretrained on a massive data corpus and then adapted to solve a variety of downstream tasks. To summarize performance across multiple tasks, evaluation metrics are often aggregated into a summary metric, e.g., average accuracy across 10 question-answering tasks. When aggregating evaluation metrics, it is useful to incorporate uncertainty in the aggregate metric in order to gain a more realistic understanding of model performance. Our objective in this work is to demonstrate how statistical methodology can be used for quantifying uncertainty in metrics that have been aggregated across multiple tasks. The methods we emphasize are bootstrapping, Bayesian hierarchical (i.e., multilevel) modeling, and the visualization of task weightings that consider standard errors. These techniques…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFault Detection and Control Systems · Software Reliability and Analysis Research