InfiniteScienceGym: An Unbounded, Procedurally-Generated Benchmark for Scientific Analysis

Oliver Bentham; Vivek Srikumar

arXiv:2604.13201·cs.CL·April 16, 2026

InfiniteScienceGym: An Unbounded, Procedurally-Generated Benchmark for Scientific Analysis

Oliver Bentham, Vivek Srikumar

PDF

TL;DR

InfiniteScienceGym is a procedurally generated benchmark for scientific reasoning that enables evaluation of language models on evidence-grounded reasoning and unanswerable questions without large static datasets.

Contribution

It introduces a novel, deterministic, self-contained scientific repository generator paired with a verifiable QA task for comprehensive model evaluation.

Findings

01

No model exceeds 45% accuracy on the benchmark.

02

Recognizing unanswerable questions is a major challenge.

03

Stronger models utilize tools more effectively rather than just token consumption.

Abstract

Large language models are emerging as scientific assistants, but evaluating their ability to reason from empirical data remains challenging. Benchmarks derived from published studies and human annotations inherit publication bias, known-knowledge bias, label noise, and substantial storage requirements. We present InfiniteScienceGym, a procedurally generated benchmark of scientific repositories paired with a verifiable question-answering task. From a seed, the simulator deterministically generates a self-contained repository with realistic directory structure, files, and tabular data, and a privileged QA generator produces both answerable and unanswerable questions with exact ground truth. This makes it possible to evaluate evidence-grounded reasoning, abstention, and tool-mediated analysis in a controlled setting without distributing a large static corpus. InfiniteScienceGym complements…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.