Synthetic Datasets for Machine Learning on Spatio-Temporal Graphs using PDEs
Jost Arndt, Utku Isil, Michael Detzel, Wojciech Samek, Jackie Ma

TL;DR
This paper introduces synthetic PDE-based datasets for spatio-temporal graph machine learning, demonstrating their use in modeling disasters and improving epidemiological predictions through benchmarking and pre-training.
Contribution
It provides a novel methodology for creating customizable PDE-based datasets for spatio-temporal graph learning, addressing data scarcity and enabling benchmarking.
Findings
Synthetic datasets support modeling of disasters and hazards.
Pre-training on synthetic data improves real-world epidemiological model performance.
Benchmarking shows effectiveness of various machine learning models on the datasets.
Abstract
Many physical processes can be expressed through partial differential equations (PDEs). Real-world measurements of such processes are often collected at irregularly distributed points in space, which can be effectively represented as graphs; however, there are currently only a few existing datasets. Our work aims to make advancements in the field of PDE-modeling accessible to the temporal graph machine learning community, while addressing the data scarcity problem, by creating and utilizing datasets based on PDEs. In this work, we create and use synthetic datasets based on PDEs to support spatio-temporal graph modeling in machine learning for different applications. More precisely, we showcase three equations to model different types of disasters and hazards in the fields of epidemiology, atmospheric particles, and tsunami waves. Further, we show how such created datasets can be used by…
Peer Reviews
Decision·Submitted to ICLR 2025
- High potential significance by creating synthetic data benchmarks in an area where they are lacking. I could certainly envision these datasets being used in future spatio-temporal graph learning papers. - Very interesting experiment on real epidemiological data demonstrates the potential of the synthetically generated data to translate to prediction tasks on real data. This is much stronger evidence for the utility of the synthetic data than I typically see in this type of paper. - Detailed co
- Some missing details on the real data experiment--see question 1 below. A more detailed description in the supplementary material would be useful. - Sizes of the datasets seem to be fixed to somewhat small spatio-temporal graphs with a few hundred nodes and a few thousand edges, potentially limiting the scope. - No results on the advection-diffusion and wave equation data in the body of the paper. Given the interests of the ICLR audience, I believe that the paper would be strengthened if there
Contribution: As the spatio-temporal graph data lacks, this paper presents an alternative. With high-quality synthetic datasets, machine learning studies can be improved compared to theoretical analysis. Presentation: The three example PDE equations came from reliable sources, and help with demonstrating how to generate synthetic datasets.
1. This paper boldly claimed that real-world datasets have limitations in quality due to high noise. I personally believe that clean synthetic datasets are better for exploration and preliminary studies. In contrast, although the real-world dataset has high noise, it is necessary before the application is implemented. So, there is a trade-off between cleanness and reality. 2. Other ways exist to generate synthetic datasets, such as quasi-Monte Carlo simulation. This paper fails to compare with
- The authors identified the gap in the literature, wherein high quality temporal graph datasets are not abundant.
### `(W1) Benchmarking` The authors use the following models: - Repetition (naive) - RNN (classic) - TST (Transformer) - MP-PDE (modified baseline) - RNN-GNN-Fusion (source of this model is not clear) - GraphEncoding (modified baseline) They report the performance of these models on the synthetic datasets in Table 1 (Forecasting column), where the model `GraphEncoding` performs worse than the naive baseline `Repetition`. This is an odd observation, and better baselines should've been used for b
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGraph Theory and Algorithms · Data Management and Algorithms
