A Survey on Large Language Model Benchmarks
Shiwen Ni, Guhong Chen, Shuaimin Li, Xuanang Chen, Siyi Li, Bingli Wang, Qiyao Wang, Xingjian Wang, Yifan Zhang, Liyang Fan, Chengming Li, Ruifeng Xu, Le Sun, Min Yang

TL;DR
This paper systematically reviews 283 large language model benchmarks, categorizing them into three types, analyzing their strengths and weaknesses, and proposing guidelines for future benchmark development to improve evaluation fairness and reliability.
Contribution
It provides the first comprehensive categorization and analysis of large language model benchmarks, highlighting current issues and offering a design paradigm for future improvements.
Findings
Current benchmarks face issues like inflated scores due to data contamination.
Biases in benchmarks lead to unfair evaluations across cultures and languages.
There is a lack of evaluation on process credibility and dynamic environments.
Abstract
In recent years, with the rapid development of the depth and breadth of large language models' capabilities, various corresponding evaluation benchmarks have been emerging in increasing numbers. As a quantitative assessment tool for model performance, benchmarks are not only a core means to measure model capabilities but also a key element in guiding the direction of model development and promoting technological innovation. We systematically review the current status and development of large language model benchmarks for the first time, categorizing 283 representative benchmarks into three categories: general capabilities, domain-specific, and target-specific. General capability benchmarks cover aspects such as core linguistics, knowledge, and reasoning; domain-specific benchmarks focus on fields like natural sciences, humanities and social sciences, and engineering technology;…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Computational and Text Analysis Methods · Natural Language Processing Techniques
