Quantifying Variance in Evaluation Benchmarks

Lovish Madaan; Aaditya K. Singh; Rylan Schaeffer; Andrew Poulton,; Sanmi Koyejo; Pontus Stenetorp; Sharan Narang; Dieuwke Hupkes

arXiv:2406.10229·cs.LG·June 17, 2024·1 cites

Quantifying Variance in Evaluation Benchmarks

Lovish Madaan, Aaditya K. Singh, Rylan Schaeffer, Andrew Poulton,, Sanmi Koyejo, Pontus Stenetorp, Sharan Narang, Dieuwke Hupkes

PDF

Open Access 3 Reviews

TL;DR

This paper investigates the sources and magnitude of variance in evaluation benchmarks for large language models, providing empirical measures, analysis of variance reduction techniques, and practical recommendations for more reliable model comparisons.

Contribution

It introduces metrics for quantifying benchmark variance, evaluates variance across models, and explores methods to reduce variance, aiding more accurate model evaluation.

Findings

01

Simple framing changes can reduce variance in smaller models.

02

Item analysis techniques have limited success in variance reduction.

03

Variance considerations are crucial for fair model comparisons.

Abstract

Evaluation benchmarks are the cornerstone of measuring capabilities of large language models (LLMs), as well as driving progress in said capabilities. Originally designed to make claims about capabilities (or lack thereof) in fully pretrained models, evaluation benchmarks are now also extensively used to decide between various training choices. Despite this widespread usage, we rarely quantify the variance in our evaluation benchmarks, which dictates whether differences in performance are meaningful. Here, we define and measure a range of metrics geared towards measuring variance in evaluation benchmarks, including seed variance across initialisations, and monotonicity during training. By studying a large number of models -- both openly available and pretrained from scratch -- we provide empirical estimates for a variety of variance metrics, with considerations and recommendations for…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 3Confidence 3

Strengths

The paper is clearly written and easy to understand.

Weaknesses

The key problem for me is that I do not get the value proposition of this work. It's difficult to see how this work could develop relevance/impact for the evaluation of foundation models. Since that's why I am hesitant to support the paper I focus my review and questions below fully on this point.

Reviewer 02Rating 5Confidence 3

Strengths

- The paper explores how to make evaluations more precise by reporting variance. - Provides estimates for the expected variance across several benchmarks and models. - An important finding is made on the unreliability of IRT-based methods for evaluation comparisons across models. This is very relevant for evaluation reporting.

Weaknesses

- The framing of ‘variance’ in the paper seems too broad. There are other possible kinds of variance worth exploring or mentioning. - The title of the paper and general framing suggests a general focus on variance in evaluations, but the paper currently fails to contextualize two very distinct types of variance: training and inference. For example, the (training) seed variance discussed falls within training. Other possible sources of variance for each should be mentioned where possible. A basic

Reviewer 03Rating 6Confidence 3

Strengths

- The paper presents a timely and important topic of variance in evaluation benchmarks, that should be widely considered while reporting benchmark performance. - The paper is cogently written and presents convincing demonstrations of the importance of considering variance in evaluations while doing model comparisons. - The paper showcases results and cautions against using sample efficient benchmarking methods while doing model pretraining, since they are likely to provide a higher-variance sig

Weaknesses

- The provided variance numbers in Tab. 1, while being important as a reference, cannot be used directly for making comparisons across different model scales or training durations since it is not clear how those numbers would change with those factors, and whether we’d expect larger or smaller deviations in performance. - Some important empirical details are missing, for example, could you provide more details on how you compute the SNRs for both discrete and continuous? This is important since

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEvaluation and Performance Assessment