SciDA: Scientific Dynamic Assessor of LLMs
Junting Zhou, Tingjia Miao, Yiyan Liao, Qichao Wang, Zhoufutu Wen, Yanqin Wang, Yunjie Huang, Ge Yan, Leqi Wang, Yucheng Xia, Hongwan Gao, Yuansong Zeng, Renjie Zheng, Chen Dun, Yitao Liang, Tong Yang, Wenhao Huang, Ge Zhang

TL;DR
SciDA is a new multidisciplinary benchmark with over 1,000 Olympic-level numerical problems designed to accurately evaluate LLMs' reasoning skills by avoiding data contamination through randomized initializations.
Contribution
It introduces SciDA, a benchmark that mitigates data contamination issues and provides unbiased assessment of LLMs' numerical reasoning abilities.
Findings
LLMs' performance drops significantly with randomized numerical initializations.
Existing benchmarks are susceptible to data contamination.
SciDA offers a more truthful evaluation of LLM reasoning capabilities.
Abstract
Advancement in Large Language Models (LLMs) reasoning capabilities enables them to solve scientific problems with enhanced efficacy. Thereby, a high-quality benchmark for comprehensive and appropriate assessment holds significance, while existing ones either confront the risk of data contamination or lack involved disciplines. To be specific, due to the data source overlap of LLMs training and static benchmark, the keys or number pattern of answers inadvertently memorized (i.e. data contamination), leading to systematic overestimation of their reasoning capabilities, especially numerical reasoning. We propose SciDA, a multidisciplinary benchmark that consists exclusively of over 1k Olympic-level numerical computation problems, allowing randomized numerical initializations for each inference round to avoid reliance on fixed numerical patterns. We conduct a series of experiments with both…
Peer Reviews
Decision·Submitted to ICLR 2026
- The data curation pipeline with Olympiad-level problems and expert annotation ensures quality and complexity. - The empirical findings of systematic performance drop show the effect of randomized conditions and reveal concerns of data contamination. - Code is provided for reproducibility.
- Some statements are a bit overclaimed. - The claim of being "contamination-proof" is not fully substantiated. This is actually a very hard problem to completely solve. - In the abstract, there is "we provide truthful and unbiased assessments of the numerical reasoning capabilities of LLMs", but there are actually no guarantees. - Writing can be refined for conciseness and professional tone.
SciDA effectively prevents models from relying on fixed numerical patterns by randomly initializing the variables in each problem. Moreover, SciDA spans multiple disciplines, including mathematics, physics, chemistry, and biology, and all its problems are drawn from Olympiad-level competitions, ensuring high quality and complexity. This comprehensive evaluation approach provides researchers with a more realistic and holistic assessment tool for evaluating the scientific reasoning capabilities of
- The core approach of SciDA, randomly initializing variables within problems, effectively mitigates model reliance on fixed numerical patterns. While this strategy reduces the risk of data contamination to some extent, it remains relatively simplistic and lacks deeper analysis or evaluation of the model’s actual reasoning process. - Although SciDA’s dynamic initialization strategy addresses data contamination to a degree, similar techniques have already been employed in other domains, such as d
1. The problem this paper addresses is a well-recognized and critical issue in the field of LLM evaluation. 2. The dynamic "functionalization" and random initialization of problems is a sound and necessary methodology for testing true generalization over memorized patterns. 3. The rigorous, expert-led data collection and annotation pipeline ensures a high-quality, difficult set of problems that yield valuable insights into model capabilities.
1. The dataset's scale (1,000 problems) is limited when distributed across four distinct scientific disciplines and multiple difficulty levels, which may affect the statistical reliability of results in specific sub-domains. 2. The paper's main conclusion is ambiguous. The performance drop is attributed to a failure in "genuine problem-solving" , yet the paper's own error analysis (Figure 4) shows "Calculation Errors" are far more common than "Logical Errors" in most subjects . This suggests mo
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsResearch Data Management Practices
