Examining the robustness of LLM evaluation to the distributional   assumptions of benchmarks

Melissa Ailem; Katerina Marazopoulou; Charlotte Siska; James; Bono

arXiv:2404.16966·cs.CL·June 7, 2024·1 cites

Examining the robustness of LLM evaluation to the distributional assumptions of benchmarks

Melissa Ailem, Katerina Marazopoulou, Charlotte Siska, James, Bono

PDF

Open Access 1 Video

TL;DR

This paper investigates how the assumption that benchmark test prompts are randomly sampled from a distribution affects LLM evaluation, revealing that prompt correlations can influence model rankings and are driven by semantic similarity and failure points.

Contribution

It demonstrates that prompt correlations impact LLM evaluation outcomes and challenges the assumption of random prompt sampling in benchmarks.

Findings

01

Performance correlation across prompts is non-random

02

Accounting for prompt correlations can alter model rankings

03

Semantic similarity and failure points explain prompt correlations

Abstract

Benchmarks have emerged as the central approach for evaluating Large Language Models (LLMs). The research community often relies on a model's average performance across the test prompts of a benchmark to evaluate the model's performance. This is consistent with the assumption that the test prompts within a benchmark represent a random sample from a real-world distribution of interest. We note that this is generally not the case; instead, we hold that the distribution of interest varies according to the specific use case. We find that (1) the correlation in model performance across test prompts is non-random, (2) accounting for correlations across test prompts can change model rankings on major benchmarks, (3) explanatory factors for these correlations include semantic similarity and common LLM failure points.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Examining the robustness of LLM evaluation to the distributional assumptions of benchmarks· underline

Taxonomy

TopicsInfrastructure Maintenance and Monitoring · Efficiency Analysis Using DEA