TL;DR
This paper introduces GAOKAO-Eval, a new benchmark based on China's Gaokao exam, revealing that high LLM scores do not necessarily indicate human-like understanding or capabilities, challenging current evaluation methods.
Contribution
The paper develops GAOKAO-Eval, applies the Rasch model to analyze LLM scoring patterns, and uncovers key discrepancies indicating limitations of current benchmark evaluations.
Findings
High scores do not correlate with human-like understanding.
LLMs show inconsistent performance across question difficulties.
Grading of LLM answers is often inconsistent among teachers.
Abstract
Large Language Models (LLMs) are commonly evaluated using human-crafted benchmarks, under the premise that higher scores implicitly reflect stronger human-like performance. However, there is growing concern that LLMs may ``game" these benchmarks due to data leakage, achieving high scores while struggling with tasks simple for humans. To substantively address the problem, we create GAOKAO-Eval, a comprehensive benchmark based on China's National College Entrance Examination (Gaokao), and conduct ``closed-book" evaluations for representative models released prior to Gaokao. Contrary to prevailing consensus, even after addressing data leakage and comprehensiveness, GAOKAO-Eval reveals that high scores still fail to truly reflect human-aligned capabilities. To better understand this mismatch, We introduce the Rasch model from cognitive psychology to analyze LLM scoring patterns and identify…
Peer Reviews
Decision·Submitted to ICLR 2025
• The proposed benchmark highlights the data-leaky issues of previous benchmarks. The annual update of GAOKAO is helpful to evaluate the LLMs performance without tedious manual data collection. • The paper evaluates a few popular LLMs on this proposed benchmark. • The paper finds that there is a performance mismatch between humans and LLMs when conducting GAOKAO tasks.
• The paper lacks clarity: o How are the human results conducted? What are the grading guidelines? How to distribute the tasks? How to validate the human evaluation process? o The paper uses Rasch model to simulate human performance. However, there lacks clarifications why GAOKAO performance could be simulated by Rasch model. The actual human performance distribution might be similar to the LLM’s. o Line 274 mentions the difficulty of questions. How is exactly the hybrid approach with human anno
• Introduces a comprehensive evaluation benchmark using Gaokao exams that updates every year with minimal/no data leakage. • Explores scoring consistency and variance with respect to question difficulty. • Attempts to model scoring behavior using cognitive psychology (Rasch model).
• The Rasch model is commonly used in human testing. But it is unclear if the Rasch model is the best fit for modeling LLM behavior, especially without fully exploring/discussing alternative psychometric models. • Some descriptions seem exaggerated. GAOKAO-Eval primarily assesses knowledge-based aspects of LLM performance, focusing on subject knowledge and question-answering within a constrained exam format. This scope limits its comprehensiveness as a benchmark for LLM capabilities, which is i
Understanding the capabilities of the LLMs is a very relevant and timely topic. I appreciate the author’s effort to curate such a valuable dataset that aims to test various abilities of the models.
I think the paper can be significantly improved and revised to clearly articulate the experiments, results, and insights. 1. The paper’s general message that LLMs’ performance varies across similar question types and that there is anomalous consistency across difficulty levels is well-studied in the literature. It would be beneficial if the authors focus on their dataset to showcase how models perform across different subjects and difficulty levels, highlighting what types of problems they per
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
