InfiniteScienceGym: An Unbounded, Procedurally-Generated Benchmark for Scientific Analysis
Oliver Bentham, Vivek Srikumar

TL;DR
InfiniteScienceGym is a procedurally generated benchmark for scientific reasoning that enables evaluation of language models on evidence-grounded reasoning and unanswerable questions without large static datasets.
Contribution
It introduces a novel, deterministic, self-contained scientific repository generator paired with a verifiable QA task for comprehensive model evaluation.
Findings
No model exceeds 45% accuracy on the benchmark.
Recognizing unanswerable questions is a major challenge.
Stronger models utilize tools more effectively rather than just token consumption.
Abstract
Large language models are emerging as scientific assistants, but evaluating their ability to reason from empirical data remains challenging. Benchmarks derived from published studies and human annotations inherit publication bias, known-knowledge bias, label noise, and substantial storage requirements. We present InfiniteScienceGym, a procedurally generated benchmark of scientific repositories paired with a verifiable question-answering task. From a seed, the simulator deterministically generates a self-contained repository with realistic directory structure, files, and tabular data, and a privileged QA generator produces both answerable and unanswerable questions with exact ground truth. This makes it possible to evaluate evidence-grounded reasoning, abstention, and tool-mediated analysis in a controlled setting without distributing a large static corpus. InfiniteScienceGym complements…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
