The Ever-Evolving Science Exam
Junying Wang, Zicheng Zhang, Yijin Guo, Farong Wen, Ye Shen, Yingji Liang, Yalun Wu, Wenzhe Li, Chunyi Li, Zijian Chen, Qi Jia, Guangtao Zhai

TL;DR
The paper introduces EESE, a dynamic, scalable science benchmark with over 100K questions across multiple disciplines, designed to reliably evaluate foundation models' scientific understanding while minimizing data leakage and evaluation costs.
Contribution
EESE is a novel, periodically updated science benchmark with a large, expertly curated question pool and a leakage-resilient subset for efficient evaluation of foundation models.
Findings
EESE effectively differentiates model strengths and weaknesses in scientific understanding.
EESE's design minimizes data leakage and evaluation overhead.
Experiments show EESE's robustness and scalability across diverse models.
Abstract
As foundation models grow rapidly in capability and deployment, evaluating their scientific understanding becomes increasingly critical. Existing science benchmarks have made progress towards broad Range, wide Reach, and high Rigor, yet they often face two major challenges: data leakage risks that compromise benchmarking validity, and evaluation inefficiency due to large-scale testing. To address these issues, we introduce the Ever-Evolving Science Exam (EESE), a dynamic benchmark designed to reliably assess scientific capabilities in foundation models. Our approach consists of two components: 1) a non-public EESE-Pool with over 100K expertly constructed science instances (question-answer pairs) across 5 disciplines and 500+ subfields, built through a multi-stage pipeline ensuring Range, Reach, and Rigor, 2) a periodically updated 500-instance subset EESE, sampled and validated to…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
The paper correctly identifies a critical and widely recognized challenge in LLM evaluation: data leakage and benchmark contamination, which can invalidate results. The stated goal of creating a large-scale, high-quality ("Rigor") benchmark covering a wide range of scientific disciplines (500+ subfields) is ambitious and, if executed transparently, would be valuable to the community.
1. The paper is fatally flawed by an irreconcilable contradiction. It states its 100K-item EESE-Pool is "non-public" (Abstract, Sec 1) , yet its Reproducibility Statement (Sec 7) explicitly "guarantee[s] that all relevant... datasets will be made publicly available". This core incoherence makes the paper's premise impossible to evaluate. 2. The paper's core proposal of a non-public dataset is useless to the purpose of a scientific benchmark, which is to provide a standardized, verifiable, and p
1. Ambitious and large-scale effort to build a benchmark addressing key issues of leakage and scalability. 2. Strong organization with clear design principles and methodology. 3. Inclusion of both natural and social science disciplines broadens scope beyond typical benchmarks. 4. Demonstrates clear model differentiation and human-model performance gaps, suggesting benchmark difficulty is appropriate. 5. Introduces a refinement process that systematically increases question complexity.
1. The dynamic “ever-evolving” aspect is underspecified. It is unclear how sustainable or reproducible the periodic updates are in practice. 2. Heavy reliance on proprietary models for evaluation and quality control (e.g., “thinking models”) makes reproducibility questionable. This also raises a question about how users can evaluate it without access to such models. 3. The use of LLMs to label and refine data raises circularity concerns and potential bias—models are used to create and test th
* The paper is easy to read and understand; it is well-written and structured well. * Comprehensive dataset of 100k question answer pairs with a nice evaluation set of 500 samples (stratified sampling across difficulty levels) that will be resampled from time to time. * Rigorous process for quality checking and human involvement. * Huge evaluating on 32 different LMMs models. * The results are clear and the discussion nice.
* The paper leaves a lot of detail out, so my understanding of it is only partial. * The description of the dataset and what is measures is not clear from time to time. I disagree that the benchmark assesses scientific capabilities, as is stated in the abstract. My understanding is that this benchmark is a test of knowledge about science and not scientific capabilities. Similar examples of imprecise language about what the benchmark tests exist elsewhere in the text as well. * I am not really a
**Usefulness**: A periodically refreshed 500 item suite can lower evaluation cost while still surfacing rank differences among models. The reported rank correlations between the 500 item subset and the full pool suggest the subset is a reasonable proxy for model ordering. This can help labs iterate faster on science capabilities while avoiding full scale runs. **Human curated difficulty**: The multi stage refinement, including expert driven edits and targeted distractor design, plausibly incre
**Comparability over time**: Because each release is a fresh resample, it is hard to compare new model results to older baselines or to chart model progress across time stamps. Without anchors or versioned fixed sets, changes in scores may reflect sampling noise as much as model improvement. This weakens the utility for longitudinal tracking and for comparing to external baselines that were reported on a different draw. **Novelty**: There are many science benchmarks and pools that use transcrip
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
