MaD Physics: Evaluating information seeking under constraints in physical environments
Moksh Jain, Mehdi Bennani, Johannes Bausch, Yuri Chervonyi, Bogdan Georgiev, Simon Osindero, Nenad Toma\v{s}ev

TL;DR
MaD Physics is a new benchmark designed to evaluate agents' ability to make informative measurements and infer physical laws under resource constraints, addressing limitations of existing scientific discovery benchmarks.
Contribution
The paper introduces MaD Physics, a benchmark with altered physical laws and constrained measurement tasks, to assess model capabilities in model inference and planning under constraints.
Findings
Benchmark reveals shortcomings in current models' exploration strategies.
Agents struggle with constrained measurement planning and physical law inference.
MaD Physics highlights areas for improving scientific reasoning in AI models.
Abstract
Scientific discovery is fundamentally a resource-constrained process that requires navigating complex trade-offs between the quality and quantity of measurements due to physical and cost constraints. Measurements drive the scientific process by revealing novel phenomena to improve our understanding. Existing benchmarks for evaluating agents for scientific discovery focus on either static knowledge-based reasoning or unconstrained experimental design tasks, and do not capture the ability to make measurements and plan under constraints. To bridge this gap, we propose Measuring and Discovering Physics (MaD Physics), a benchmark to evaluate the ability of agents to make informative measurements and conclusions subject to constraints on the quality and quantity of measurements. The benchmark consists of three environments, each based on a distinct physical law. To mitigate contamination from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
