HydroAgent: Closing the Gap Between Frontier LLMs and Human Experts in Hydrologic Model Calibration via Simulator-Grounded RL

Zhi Li; Songkun Yan; Jie Cao; Mofan Zhang; Anjiang Wei; Jinwoong Yoo; Yang Hong

arXiv:2605.17792·cs.LG·May 19, 2026

HydroAgent: Closing the Gap Between Frontier LLMs and Human Experts in Hydrologic Model Calibration via Simulator-Grounded RL

Zhi Li, Songkun Yan, Jie Cao, Mofan Zhang, Anjiang Wei, Jinwoong Yoo, Yang Hong

PDF

TL;DR

This paper evaluates the ability of frontier large language models to calibrate hydrologic models and introduces HYDROAGENT, a domain-specific RL approach that improves calibration accuracy using simulation feedback.

Contribution

It benchmarks multiple LLMs for hydrologic calibration and proposes HYDROAGENT, a fine-tuned RL agent that outperforms generic models in domain-specific tasks.

Findings

01

Best LLM achieves NSE of 0.75, close to human experts

02

Domain-grounded RL improves calibration efficiency and accuracy

03

Scaling models alone does not close the performance gap

Abstract

Calibrating distributed hydrologic models is a critical bottleneck across operational water resources management - streamflow prediction, reservoir operation, drought monitoring, infrastructure design, and flood forecasting all depend on it. Each basin demands an expert to translate hydrograph signatures into adjustments of a high-dimensional parameter vector, and the resulting workflow does not transfer between watersheds. We ask: can frontier large language model (LLM) agents replace the human hydrologic modeler, and if not, what would it take? We benchmark nine frontier LLM agents - Claude Opus 4.6/4.7, Sonnet 4.6, GPT-5/5.4/5.4-pro, and Gemini 2.5-pro/3.1-pro/3-flash - on the operational CREST distributed hydrologic model used by the U.S. National Weather Service for flash-flood forecasting. Best-of-twenty-rounds Nash-Sutcliffe Efficiency (NSE) across four held-out gauges spanning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.