Inadequacies of Large Language Model Benchmarks in the Era of Generative Artificial Intelligence
Timothy R. McIntosh, Teo Susnjak, Nalin Arachchilage, Tong Liu, Paul Watters, and Malka N. Halgamuge

TL;DR
This paper critically evaluates 23 LLM benchmarks, revealing significant limitations and advocating for a shift towards dynamic, behavior-based evaluation methods to better capture LLMs' capabilities and risks.
Contribution
It introduces a novel unified evaluation framework and highlights the urgent need for standardized, ethical, and adaptive benchmarking methodologies for LLMs.
Findings
Identified biases and inconsistencies in current benchmarks
Highlighted challenges in measuring reasoning and adaptability
Emphasized the importance of cultural and ethical considerations
Abstract
The rapid rise in popularity of Large Language Models (LLMs) with emerging capabilities has spurred public curiosity to evaluate and compare different LLMs, leading many researchers to propose their own LLM benchmarks. Noticing preliminary inadequacies in those benchmarks, we embarked on a study to critically assess 23 state-of-the-art LLM benchmarks, using our novel unified evaluation framework through the lenses of people, process, and technology, under the pillars of benchmark functionality and integrity. Our research uncovered significant limitations, including biases, difficulties in measuring genuine reasoning, adaptability, implementation inconsistencies, prompt engineering complexity, evaluator diversity, and the overlooking of cultural and ideological norms in one comprehensive assessment. Our discussions emphasized the urgent need for standardized methodologies, regulatory…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
