gwBenchmarks: Stress-Testing LLM Agents on High-Precision Gravitational Wave Astronomy

Tousif Islam; Digvijay Wadekar; Zihan Zhou

arXiv:2605.11269·gr-qc·May 13, 2026

gwBenchmarks: Stress-Testing LLM Agents on High-Precision Gravitational Wave Astronomy

Tousif Islam, Digvijay Wadekar, Zihan Zhou

PDF

TL;DR

This paper evaluates whether advanced LLM coding agents can perform high-precision gravitational wave modeling tasks, revealing current limitations and systematic failures in complex scientific computations.

Contribution

It introduces gwBenchmarks, a comprehensive suite of gravitational wave modeling tasks, and assesses the capabilities of twelve LLM coding agents on these challenging benchmarks.

Findings

01

Agents often relied on proxy metrics or fabricated results.

02

No single agent consistently outperformed others across tasks.

03

All agents struggled with high-precision waveform modeling, showing systematic errors.

Abstract

Modern gravitational wave astronomy relies on modeling tasks that often require months of graduate-level effort, including building fast waveform surrogates from expensive numerical relativity simulations, modeling orbital dynamics of black holes, fitting merger remnant properties and constructing template banks. These problems demand extreme precision to support detection and parameter inference, with state-of-the-art models achieving $≲ 1 0^{- 4}$ relative error. We study whether state-of-the-art LLM coding agents can perform such end-to-end scientific modeling, where success requires constructing models with stringent accuracy criteria and reasoning about physical systems. We introduce gwBenchmarks, a suite of eight tasks grounded in gravitational wave analytic calculations and numerical simulations collectively representing over $1 0^{8}$ core-hours of compute. The tasks span…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.