Introducing Schema Inference as a Scalable SQL Function [Extended Version]
Calvin Dani, Shiva Jahangiri, Thomas H\"utter

TL;DR
This paper presents a scalable, integrated SQL function for schema inference within a DBMS, significantly improving performance over external methods and enhancing schema management in NoSQL databases.
Contribution
It introduces a novel in-DBMS schema inference function that performs local and global schema discovery, eliminating reliance on external frameworks.
Findings
Up to 100x performance improvement over external methods
Successful implementation in Apache AsterixDB
Effective schema inference on real-world datasets
Abstract
This paper introduces a novel approach to schema inference as an on-demand function integrated directly within a DBMS, targeting NoSQL databases where schema flexibility can create challenges. Unlike previous methods relying on external frameworks like Apache Spark, our solution enables schema inference as a SQL function, allowing users to infer schemas natively within the DBMS. Implemented in Apache AsterixDB, it performs schema discovery in two phases, local inference and global schema merging, leveraging internal resources for improved performance. Experiments with real world datasets show up to a two orders of magnitude performance boost over external methods, enhancing usability and scalability.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Advanced Database Systems and Queries
