Schema Extraction on Semi-structured Data

Panpan Li; Yikun Gong; Chen Wang

arXiv:2012.08105·cs.DB·October 19, 2021·1 cites

Schema Extraction on Semi-structured Data

Panpan Li, Yikun Gong, Chen Wang

PDF

Open Access

TL;DR

This paper surveys various schema extraction techniques for semi-structured data in NoSQL databases, comparing structural and statistical methods, tools, and systems to aid data management and understanding.

Contribution

It provides a comprehensive overview of existing schema extraction methods, tools, and systems, highlighting their applicability, interpretability, and generalization capabilities.

Findings

01

Structural methods yield more interpretable schemas.

02

Statistical methods offer better applicability and generalization.

03

Tools are suitable for small datasets; systems handle large, complex data.

Abstract

With the continuous development of NoSQL databases, more and more developers choose to use semi-structured data for development and data management, which puts forward requirements for schema management of semi-structured data stored in NoSQL databases. Schema extraction plays an important role in understanding schemas, optimizing queries, and validating data consistency. Therefore, in this survey we investigate structural methods based on tree and graph and statistical methods based on distributed architecture and machine learning to extract schemas. The schemas obtained by the structural methods are more interpretable, and the statistical methods have better applicability and generalization ability. Moreover, we also investigate tools and systems for schemas extraction. Schema extraction tools are mainly used for spark or NoSQL databases, and are suitable for small datasets or simple…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Database Systems and Queries · Data Quality and Management · Semantic Web and Ontologies