JSONoid: Monoid-based Enrichment for Configurable and Scalable Data-Driven Schema Discovery
Michael J. Mior

TL;DR
JSONoid is a distributed schema discovery method for JSON data that uses monoid data structures to efficiently generate schemas and enrich them with valuable metadata, scalable to large datasets.
Contribution
It introduces a monoid-based approach for scalable, distributed schema discovery that also provides enriched metadata about data values.
Findings
Performs comparably to existing distributed schema discovery methods
Enriches schemas with additional data value metadata
Scales linearly with dataset size
Abstract
Schema discovery is an important aspect to working with data in formats such as JSON. Unlike relational databases, JSON data sets often do not have associated structural information. Consumers of such datasets are often left to browse through data in an attempt to observe commonalities in structure across documents to construct suitable code for data processing. However, this process is time-consuming and error-prone. Existing distributed approaches to mining schemas present a significant usability advantage as they provide useful metadata for large data sources. However, depending on the data source, ad hoc queries for estimating other properties to help with crafting an efficient data pipeline can be expensive. We propose JSONoid, a distributed schema discovery process augmented with additional metadata in the form of monoid data structures that are easily maintainable in a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Mining Algorithms and Applications · Advanced Database Systems and Queries · Semantic Web and Ontologies
