JSONoid: Monoid-based Enrichment for Configurable and Scalable   Data-Driven Schema Discovery

Michael J. Mior

arXiv:2307.03113·cs.DB·July 7, 2023·1 cites

JSONoid: Monoid-based Enrichment for Configurable and Scalable Data-Driven Schema Discovery

Michael J. Mior

PDF

Open Access 1 Repo

TL;DR

JSONoid is a distributed schema discovery method for JSON data that uses monoid data structures to efficiently generate schemas and enrich them with valuable metadata, scalable to large datasets.

Contribution

It introduces a monoid-based approach for scalable, distributed schema discovery that also provides enriched metadata about data values.

Findings

01

Performs comparably to existing distributed schema discovery methods

02

Enriches schemas with additional data value metadata

03

Scales linearly with dataset size

Abstract

Schema discovery is an important aspect to working with data in formats such as JSON. Unlike relational databases, JSON data sets often do not have associated structural information. Consumers of such datasets are often left to browse through data in an attempt to observe commonalities in structure across documents to construct suitable code for data processing. However, this process is time-consuming and error-prone. Existing distributed approaches to mining schemas present a significant usability advantage as they provide useful metadata for large data sources. However, depending on the data source, ad hoc queries for estimating other properties to help with crafting an efficient data pipeline can be expensive. We propose JSONoid, a distributed schema discovery process augmented with additional metadata in the form of monoid data structures that are easily maintainable in a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dataunitylab/jsonoid-discovery
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Mining Algorithms and Applications · Advanced Database Systems and Queries · Semantic Web and Ontologies