GraphNetz: Statistical Benchmarking of Graph Neural Networks with Paired Tests and Rank Aggregation
Kleyton da Costa, Bernardo Modenesi

TL;DR
GraphNetz introduces a comprehensive benchmarking framework for GNNs that emphasizes statistical rigor, providing confidence intervals, paired tests, and rank aggregation to ensure fair and reproducible comparisons.
Contribution
It offers a standardized, statistically principled benchmarking pipeline for GNNs, including multiple datasets, models, and tasks, with automatic statistical reporting.
Findings
No significant difference among four canonical GNN encoders at α=0.05
Framework supports 63 datasets, 4 task types, 5 GNN architectures
Provides reproducible, statistically validated benchmarks for graph learning
Abstract
Graph Neural Networks (GNNs) benchmarks often report single point estimates, even when performance differences are small relative to variation across random seeds, train/test splits, and datasets. Confidence intervals, paired comparisons, multiple-comparison correction, and rank-based aggregation are standard statistical tools, but they are rarely the default output of graph-learning benchmark suites. We introduce GraphNetz, a benchmarking framework whose default output is a structured statistical report rather than a raw accuracy table. GraphNetz currently includes 63 dataset loaders, four task types, and five canonical GNN architectures, while also supporting custom datasets and models. The framework standardizes multi-seed evaluation and automatically returns per-cell confidence intervals, Holm-corrected paired tests, and Friedman-Nemenyi critical-difference diagrams across tasks. In…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
