Stargazer: A Scalable Model-Fitting Benchmark Environment for AI Agents under Astrophysical Constraints
Xinge Liu, Terry Jingchen Zhang, Bernhard Sch\"olkopf, Zhijing Jin, Kristen Menou

TL;DR
Stargazer is a scalable, physics-grounded benchmark environment for evaluating AI agents on complex model-fitting tasks using radial-velocity data, highlighting current limitations in physical parameter recovery.
Contribution
We introduce Stargazer, a novel environment with diverse astrophysical tasks for assessing AI agents' ability to fit models under physical constraints.
Findings
Agents often fit data well statistically but fail to recover correct physical parameters.
Increasing compute yields marginal improvements, often due to recursive failure loops.
The environment can guide development of better AI strategies for scientific model fitting.
Abstract
The rise of autonomous AI agents suggests that dynamic benchmark environments with built-in feedback on scientifically grounded tasks are needed to evaluate the capabilities of these agents in research work. We introduce Stargazer, a scalable environment for evaluating AI agents on dynamic, iterative physics-grounded model-fitting tasks using inference on radial-velocity (RV) time series data. Stargazer comprises 120 tasks across three difficulty tiers, including 20 real archival cases, covering diverse scenarios ranging from high-SNR single-planet systems to complex multi-planetary configurations requiring involved low-SNR analysis. Our evaluation of eight frontier agents reveals a gap between numerical optimization and adherence to physical constraints: although agents often achieve a good statistical fit, they frequently fail to recover correct physical system parameters, a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
