SchemaDB: Structures in Relational Datasets
Cody James Christopher, Kristen Moore, David Liebowitz

TL;DR
SchemaDB is a comprehensive dataset of 2,500 relational database schemata in SQL and graph formats, enabling research into real-world database structures which are typically private and under-studied.
Contribution
The paper introduces SchemaDB, a novel, standardized collection of relational schemata from public repositories, facilitating structural analysis and downstream research.
Findings
Provides detailed summary statistics of the schemata
Offers insights into common structural patterns
Facilitates future research in database structure analysis
Abstract
In this paper we introduce the SchemaDB data-set; a collection of relational database schemata in both sql and graph formats. Databases are not commonly shared publicly for reasons of privacy and security, so schemata are not available for study. Consequently, an understanding of database structures in the wild is lacking, and most examples found publicly belong to common development frameworks or are derived from textbooks or engine benchmark designs. SchemaDB contains 2,500 samples of relational schemata found in public repositories which we have standardised to MySQL syntax. We provide our gathering and transformation methodology, summary statistics, and structural analysis, and discuss potential downstream research tasks in several domains.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Advanced Graph Neural Networks · Privacy-Preserving Technologies in Data
