Evaluating Hydro-Science and Engineering Knowledge of Large Language Models

Shiruo Hu; Wenbo Shan; Yingjia Li; Zhiqi Wan; Xinpeng Yu; Yunjia Qi; Haotian Xia; Yang Xiao; Dingxiao Liu; Jiaru Wang; Chenxu Gong; Ruixi Zhang; Shuyue Wu; Shibo Cui; Chee Hui Lai; Wei Luo; Yubin He; Bin Xu; Jianshi Zhao

arXiv:2512.03672·cs.CL·December 4, 2025

Evaluating Hydro-Science and Engineering Knowledge of Large Language Models

Shiruo Hu, Wenbo Shan, Yingjia Li, Zhiqi Wan, Xinpeng Yu, Yunjia Qi, Haotian Xia, Yang Xiao, Dingxiao Liu, Jiaru Wang, Chenxu Gong, Ruixi Zhang, Shuyue Wu, Shibo Cui, Chee Hui Lai, Wei Luo, Yubin He, Bin Xu, Jianshi Zhao

PDF

Open Access

TL;DR

This paper introduces a comprehensive benchmark to evaluate large language models' knowledge and application abilities in Hydro-Science and Engineering, revealing their strengths and limitations across various subfields.

Contribution

It presents the Hydro-SE Bench, a new evaluation dataset with 4,000 questions covering nine subfields, to systematically assess LLMs in Hydro-SE.

Findings

01

LLMs perform with 0.74-0.80 accuracy on commercial models.

02

Small LLMs achieve 0.41-0.68 accuracy.

03

Scaling improves reasoning and calculation abilities.

Abstract

Hydro-Science and Engineering (Hydro-SE) is a critical and irreplaceable domain that secures human water supply, generates clean hydropower energy, and mitigates flood and drought disasters. Featuring multiple engineering objectives, Hydro-SE is an inherently interdisciplinary domain that integrates scientific knowledge with engineering expertise. This integration necessitates extensive expert collaboration in decision-making, which poses difficulties for intelligence. With the rapid advancement of large language models (LLMs), their potential application in the Hydro-SE domain is being increasingly explored. However, the knowledge and application abilities of LLMs in Hydro-SE have not been sufficiently evaluated. To address this issue, we propose the Hydro-SE LLM evaluation benchmark (Hydro-SE Bench), which contains 4,000 multiple-choice questions. Hydro-SE Bench covers nine subfields…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Advanced Graph Neural Networks · Multimodal Machine Learning Applications