The Base-Rate Effect on LLM Benchmark Performance: Disambiguating   Test-Taking Strategies from Benchmark Performance

Kyle Moore; Jesse Roberts; Thao Pham; Oseremhen Ewaleifoh; Doug Fisher

arXiv:2406.11634·cs.CL·October 1, 2024

The Base-Rate Effect on LLM Benchmark Performance: Disambiguating Test-Taking Strategies from Benchmark Performance

Kyle Moore, Jesse Roberts, Thao Pham, Oseremhen Ewaleifoh, Doug Fisher

PDF

Open Access

TL;DR

This paper investigates how base-rate probabilities influence large language model benchmark performance, revealing that test-taking strategies can confound true task ability measurements, and proposes a new task to better distinguish these factors.

Contribution

The study identifies the impact of base-rate effects on LLM benchmark results and introduces the Nvr-X-MMLU task to separate test-taking strategies from genuine task performance.

Findings

01

Base-rate differences significantly affect LLM test performance.

02

Counterfactual prompting mitigates the base-rate effect.

03

The Nvr-X-MMLU task disambiguates test-taking ability from task performance.

Abstract

Cloze testing is a common method for measuring the behavior of large language models on a number of benchmark tasks. Using the MMLU dataset, we show that the base-rate probability (BRP) differences across answer tokens are significant and affect task performance ie. guess A if uncertain. We find that counterfactual prompting does sufficiently mitigate the BRP effect. The BRP effect is found to have a similar effect to test taking strategies employed by humans leading to the conflation of task performance and test-taking ability. We propose the Nvr-X-MMLU task, a variation of MMLU, which helps to disambiguate test-taking ability from task performance and reports the latter.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital Rights Management and Security · Innovative Microfluidic and Catalytic Techniques Innovation · Evolutionary Algorithms and Applications