RealPDEBench: A Benchmark for Complex Physical Systems with Real-World Data

Peiyan Hu; Haodong Feng; Hongyuan Liu; Tongtong Yan; Wenhao Deng; Tianrun Gao; Rong Zheng; Haoren Zheng; Chenglei Yu; Chuanrui Wang; Kaiwen Li; Zhi-Ming Ma; Dezhi Zhou; Xingcai Lu; Dixia Fan; Tailin Wu

arXiv:2601.01829·cs.LG·February 10, 2026

RealPDEBench: A Benchmark for Complex Physical Systems with Real-World Data

Peiyan Hu, Haodong Feng, Hongyuan Liu, Tongtong Yan, Wenhao Deng, Tianrun Gao, Rong Zheng, Haoren Zheng, Chenglei Yu, Chuanrui Wang, Kaiwen Li, Zhi-Ming Ma, Dezhi Zhou, Xingcai Lu, Dixia Fan, Tailin Wu

PDF

Open Access 1 Datasets 3 Reviews

TL;DR

RealPDEBench introduces a comprehensive benchmark integrating real-world measurements with simulations for complex physical systems, enabling better evaluation and development of scientific ML models to bridge the sim-to-real gap.

Contribution

This work presents the first benchmark combining real-world and simulated data for physical systems, including datasets, tasks, metrics, and baselines to advance scientific ML research.

Findings

01

Pretraining with simulated data improves accuracy and convergence.

02

Significant discrepancies exist between real-world and simulated data.

03

Benchmark facilitates comparison of models on real-world physical data.

Abstract

Predicting the evolution of complex physical systems remains a central problem in science and engineering. Despite rapid progress in scientific Machine Learning (ML) models, a critical bottleneck is the lack of expensive real-world data, resulting in most current models being trained and validated on simulated data. Beyond limiting the development and evaluation of scientific ML, this gap also hinders research into essential tasks such as sim-to-real transfer. We introduce RealPDEBench, the first benchmark for scientific ML that integrates real-world measurements with paired numerical simulations. RealPDEBench consists of five datasets, three tasks, eight metrics, and ten baselines. We first present five real-world measured datasets with paired simulated datasets across different complex physical systems. We further define three tasks, which allow comparisons between real-world and…

Peer Reviews

Decision·ICLR 2026 Oral

Reviewer 01Rating 4Confidence 4

Strengths

- The benchmark collects paired real and simulated trajectories for several nontrivial systems instead of yet another synthetic-only PDE suite. This is likely to be useful for the community. - The split of training regimes (simulation only, real only, pretrain on simulation then finetune on real) is useful, and the pretraining result is consistent with what many of us have seen in practice. - The code appears modular enough to add a new dataset or baseline without painful surgery, and using a si

Weaknesses

Major concerns: - The documentation about the experiment is unacceptably thin in its current form. If the experimental data was created or modified from another source, you need to cite it. If someone else created the dataset for you, you need them to write documentation for it. The current documentation on experimental data generation (which, of course, can be included in the appendix) is simply unacceptable for publication, especially for a paper which is supposed to be about this very datase

Reviewer 02Rating 6Confidence 3

Strengths

This dataset addresses the important aspect of bridging the gap between the simulated (numerical) and the physical system. While is not possible to cover a large experimental setup, the authors provide $5$ taks with both numerical and physical data. This work will therefore help in evaluating new models, even if may not cover all possible scenarios.

Weaknesses

It is hard to say, but it is not possible to cover all possible physical experimental conditions. Nevertheless, the paper is a good contribution in the right direction. The main point is that the numerical generation scripts are missing, therefore not possible to extend the data (at least numerical) to other scenarios. On the experimental side, I am not able to judge if the information is sufficient. I found the task description disconnected to the actual experiments. I would encourage the

Reviewer 03Rating 10Confidence 4

Strengths

The paper is original, and tackles a difficult challenge of pairing real and simulated data. This a clear gap in the literature, where simulated data was the only solution before. This provides a unique insight into many of the claims made on various architectures aimed at training surrogates for PDEs. It is comprehensive, well written, and thoughtfully formulated (especially the figures). It appears to achieve the maximum possible level of reproducibility through the use of the anonymous github

Weaknesses

The only weakness is the obvious one - out of domain regimes are not covered. However, this is probably the biggest area of weakness for this area of study as a whole. The complexity of fluid dynamics makes that a separate challenge entirely (one I dont see being solved any time soon). Surrogates typically cover some precise range of reynolds numbers around a specific geometry. It is the nature of this domain.

Code & Models

Datasets

AI4Science-WestlakeU/RealPDEBench
dataset· 4.8k dl
4.8k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Materials Science · Generative Adversarial Networks and Image Synthesis · Gaussian Processes and Bayesian Inference