Inadequacies of Large Language Model Benchmarks in the Era of Generative Artificial Intelligence

Timothy R. McIntosh; Teo Susnjak; Nalin Arachchilage; Tong Liu; Paul Watters; and Malka N. Halgamuge

arXiv:2402.09880·cs.AI·May 15, 2025·42 cites

Inadequacies of Large Language Model Benchmarks in the Era of Generative Artificial Intelligence

Timothy R. McIntosh, Teo Susnjak, Nalin Arachchilage, Tong Liu, Paul Watters, and Malka N. Halgamuge

PDF

Open Access

TL;DR

This paper critically evaluates 23 LLM benchmarks, revealing significant limitations and advocating for a shift towards dynamic, behavior-based evaluation methods to better capture LLMs' capabilities and risks.

Contribution

It introduces a novel unified evaluation framework and highlights the urgent need for standardized, ethical, and adaptive benchmarking methodologies for LLMs.

Findings

01

Identified biases and inconsistencies in current benchmarks

02

Highlighted challenges in measuring reasoning and adaptability

03

Emphasized the importance of cultural and ethical considerations

Abstract

The rapid rise in popularity of Large Language Models (LLMs) with emerging capabilities has spurred public curiosity to evaluate and compare different LLMs, leading many researchers to propose their own LLM benchmarks. Noticing preliminary inadequacies in those benchmarks, we embarked on a study to critically assess 23 state-of-the-art LLM benchmarks, using our novel unified evaluation framework through the lenses of people, process, and technology, under the pillars of benchmark functionality and integrity. Our research uncovered significant limitations, including biases, difficulties in measuring genuine reasoning, adaptability, implementation inconsistencies, prompt engineering complexity, evaluator diversity, and the overlooking of cultural and ideological norms in one comprehensive assessment. Our discussions emphasized the urgent need for standardized methodologies, regulatory…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling