SchemaDB: Structures in Relational Datasets

Cody James Christopher; Kristen Moore; David Liebowitz

arXiv:2111.12835·cs.DB·May 30, 2025

SchemaDB: Structures in Relational Datasets

Cody James Christopher, Kristen Moore, David Liebowitz

PDF

Open Access

TL;DR

SchemaDB is a comprehensive dataset of 2,500 relational database schemata in SQL and graph formats, enabling research into real-world database structures which are typically private and under-studied.

Contribution

The paper introduces SchemaDB, a novel, standardized collection of relational schemata from public repositories, facilitating structural analysis and downstream research.

Findings

01

Provides detailed summary statistics of the schemata

02

Offers insights into common structural patterns

03

Facilitates future research in database structure analysis

Abstract

In this paper we introduce the SchemaDB data-set; a collection of relational database schemata in both sql and graph formats. Databases are not commonly shared publicly for reasons of privacy and security, so schemata are not available for study. Consequently, an understanding of database structures in the wild is lacking, and most examples found publicly belong to common development frameworks or are derived from textbooks or engine benchmark designs. SchemaDB contains 2,500 samples of relational schemata found in public repositories which we have standardised to MySQL syntax. We provide our gathering and transformation methodology, summary statistics, and structural analysis, and discuss potential downstream research tasks in several domains.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Advanced Graph Neural Networks · Privacy-Preserving Technologies in Data