A Survey on Large Language Model Benchmarks

Shiwen Ni; Guhong Chen; Shuaimin Li; Xuanang Chen; Siyi Li; Bingli Wang; Qiyao Wang; Xingjian Wang; Yifan Zhang; Liyang Fan; Chengming Li; Ruifeng Xu; Le Sun; Min Yang

arXiv:2508.15361·cs.CL·August 22, 2025

A Survey on Large Language Model Benchmarks

Shiwen Ni, Guhong Chen, Shuaimin Li, Xuanang Chen, Siyi Li, Bingli Wang, Qiyao Wang, Xingjian Wang, Yifan Zhang, Liyang Fan, Chengming Li, Ruifeng Xu, Le Sun, Min Yang

PDF

Open Access

TL;DR

This paper systematically reviews 283 large language model benchmarks, categorizing them into three types, analyzing their strengths and weaknesses, and proposing guidelines for future benchmark development to improve evaluation fairness and reliability.

Contribution

It provides the first comprehensive categorization and analysis of large language model benchmarks, highlighting current issues and offering a design paradigm for future improvements.

Findings

01

Current benchmarks face issues like inflated scores due to data contamination.

02

Biases in benchmarks lead to unfair evaluations across cultures and languages.

03

There is a lack of evaluation on process credibility and dynamic environments.

Abstract

In recent years, with the rapid development of the depth and breadth of large language models' capabilities, various corresponding evaluation benchmarks have been emerging in increasing numbers. As a quantitative assessment tool for model performance, benchmarks are not only a core means to measure model capabilities but also a key element in guiding the direction of model development and promoting technological innovation. We systematically review the current status and development of large language model benchmarks for the first time, categorizing 283 representative benchmarks into three categories: general capabilities, domain-specific, and target-specific. General capability benchmarks cover aspects such as core linguistics, knowledge, and reasoning; domain-specific benchmarks focus on fields like natural sciences, humanities and social sciences, and engineering technology;…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Computational and Text Analysis Methods · Natural Language Processing Techniques