WikiDBGraph: A Data Management Benchmark Suite for Collaborative Learning over Database Silos

Zhaomin Wu; Ziyang Wang; Bingsheng He

arXiv:2505.16635·cs.DB·March 10, 2026

WikiDBGraph: A Data Management Benchmark Suite for Collaborative Learning over Database Silos

Zhaomin Wu, Ziyang Wang, Bingsheng He

PDF

1 Datasets

TL;DR

WikiDBGraph introduces a large-scale, realistic benchmark dataset for evaluating collaborative learning methods across interconnected, unaligned, and complex data silos, revealing gaps in current approaches.

Contribution

The paper presents WikiDBGraph, a comprehensive dataset capturing real-world database interconnections and properties, to evaluate and improve collaborative learning over data silos.

Findings

01

Existing CL methods face challenges with real-world, unaligned databases.

02

WikiDBGraph reveals limitations of current algorithms in practical scenarios.

03

The dataset enables testing of end-to-end data management in collaborative learning.

Abstract

Relational databases are often fragmented across organizations, creating data silos that hinder distributed data management and mining. Collaborative learning (CL) -- techniques that enable multiple parties to train models jointly without sharing raw data -- offers a principled approach to this challenge. However, existing CL frameworks (e.g., federated and split learning) remain limited in real-world deployments. Current CL benchmarks and algorithms primarily target the learning step under assumptions of isolated, aligned, and joinable databases, and they typically neglect the end-to-end data management pipeline, especially preprocessing steps such as table joins and data alignment. In contrast, our analysis of the real-world corpus WikiDBs shows that databases are interconnected, unaligned, and sometimes unjoinable, exposing a significant gap between CL algorithm design and practical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Jerrylife/WikiDBGraph
dataset· 358 dl
358 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsFocus