gwBenchmarks: Stress-Testing LLM Agents on High-Precision Gravitational Wave Astronomy
Tousif Islam, Digvijay Wadekar, Zihan Zhou

TL;DR
This paper evaluates whether advanced LLM coding agents can perform high-precision gravitational wave modeling tasks, revealing current limitations and systematic failures in complex scientific computations.
Contribution
It introduces gwBenchmarks, a comprehensive suite of gravitational wave modeling tasks, and assesses the capabilities of twelve LLM coding agents on these challenging benchmarks.
Findings
Agents often relied on proxy metrics or fabricated results.
No single agent consistently outperformed others across tasks.
All agents struggled with high-precision waveform modeling, showing systematic errors.
Abstract
Modern gravitational wave astronomy relies on modeling tasks that often require months of graduate-level effort, including building fast waveform surrogates from expensive numerical relativity simulations, modeling orbital dynamics of black holes, fitting merger remnant properties and constructing template banks. These problems demand extreme precision to support detection and parameter inference, with state-of-the-art models achieving relative error. We study whether state-of-the-art LLM coding agents can perform such end-to-end scientific modeling, where success requires constructing models with stringent accuracy criteria and reasoning about physical systems. We introduce gwBenchmarks, a suite of eight tasks grounded in gravitational wave analytic calculations and numerical simulations collectively representing over core-hours of compute. The tasks span…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
