SciDA: Scientific Dynamic Assessor of LLMs

Junting Zhou; Tingjia Miao; Yiyan Liao; Qichao Wang; Zhoufutu Wen; Yanqin Wang; Yunjie Huang; Ge Yan; Leqi Wang; Yucheng Xia; Hongwan Gao; Yuansong Zeng; Renjie Zheng; Chen Dun; Yitao Liang; Tong Yang; Wenhao Huang; Ge Zhang

arXiv:2506.12909·cs.CL·June 17, 2025

SciDA: Scientific Dynamic Assessor of LLMs

Junting Zhou, Tingjia Miao, Yiyan Liao, Qichao Wang, Zhoufutu Wen, Yanqin Wang, Yunjie Huang, Ge Yan, Leqi Wang, Yucheng Xia, Hongwan Gao, Yuansong Zeng, Renjie Zheng, Chen Dun, Yitao Liang, Tong Yang, Wenhao Huang, Ge Zhang

PDF

Open Access 3 Reviews

TL;DR

SciDA is a new multidisciplinary benchmark with over 1,000 Olympic-level numerical problems designed to accurately evaluate LLMs' reasoning skills by avoiding data contamination through randomized initializations.

Contribution

It introduces SciDA, a benchmark that mitigates data contamination issues and provides unbiased assessment of LLMs' numerical reasoning abilities.

Findings

01

LLMs' performance drops significantly with randomized numerical initializations.

02

Existing benchmarks are susceptible to data contamination.

03

SciDA offers a more truthful evaluation of LLM reasoning capabilities.

Abstract

Advancement in Large Language Models (LLMs) reasoning capabilities enables them to solve scientific problems with enhanced efficacy. Thereby, a high-quality benchmark for comprehensive and appropriate assessment holds significance, while existing ones either confront the risk of data contamination or lack involved disciplines. To be specific, due to the data source overlap of LLMs training and static benchmark, the keys or number pattern of answers inadvertently memorized (i.e. data contamination), leading to systematic overestimation of their reasoning capabilities, especially numerical reasoning. We propose SciDA, a multidisciplinary benchmark that consists exclusively of over 1k Olympic-level numerical computation problems, allowing randomized numerical initializations for each inference round to avoid reliance on fixed numerical patterns. We conduct a series of experiments with both…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 3

Strengths

- The data curation pipeline with Olympiad-level problems and expert annotation ensures quality and complexity. - The empirical findings of systematic performance drop show the effect of randomized conditions and reveal concerns of data contamination. - Code is provided for reproducibility.

Weaknesses

- Some statements are a bit overclaimed. - The claim of being "contamination-proof" is not fully substantiated. This is actually a very hard problem to completely solve. - In the abstract, there is "we provide truthful and unbiased assessments of the numerical reasoning capabilities of LLMs", but there are actually no guarantees. - Writing can be refined for conciseness and professional tone.

Reviewer 02Rating 2Confidence 4

Strengths

SciDA effectively prevents models from relying on fixed numerical patterns by randomly initializing the variables in each problem. Moreover, SciDA spans multiple disciplines, including mathematics, physics, chemistry, and biology, and all its problems are drawn from Olympiad-level competitions, ensuring high quality and complexity. This comprehensive evaluation approach provides researchers with a more realistic and holistic assessment tool for evaluating the scientific reasoning capabilities of

Weaknesses

- The core approach of SciDA, randomly initializing variables within problems, effectively mitigates model reliance on fixed numerical patterns. While this strategy reduces the risk of data contamination to some extent, it remains relatively simplistic and lacks deeper analysis or evaluation of the model’s actual reasoning process. - Although SciDA’s dynamic initialization strategy addresses data contamination to a degree, similar techniques have already been employed in other domains, such as d

Reviewer 03Rating 4Confidence 4

Strengths

1. The problem this paper addresses is a well-recognized and critical issue in the field of LLM evaluation. 2. The dynamic "functionalization" and random initialization of problems is a sound and necessary methodology for testing true generalization over memorized patterns. 3. The rigorous, expert-led data collection and annotation pipeline ensures a high-quality, difficult set of problems that yield valuable insights into model capabilities.

Weaknesses

1. The dataset's scale (1,000 problems) is limited when distributed across four distinct scientific disciplines and multiple difficulty levels, which may affect the statistical reliability of results in specific sub-domains. 2. The paper's main conclusion is ambiguous. The performance drop is attributed to a failure in "genuine problem-solving" , yet the paper's own error analysis (Figure 4) shows "Calculation Errors" are far more common than "Logical Errors" in most subjects . This suggests mo

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsResearch Data Management Practices