An LSM-based Tuple Compaction Framework for Apache AsterixDB (Extended Version)
Wail Y. Alkowaileet, Sattam Alsubaiee, Michael J. Carey

TL;DR
This paper introduces an LSM-based tuple compaction framework for Apache AsterixDB that infers schemas from semi-structured records during ingestion, reducing storage overhead and improving query efficiency.
Contribution
It presents a novel schema inference and extraction framework leveraging LSM lifecycle events, enhancing storage efficiency in document database systems.
Findings
Reduces storage overhead significantly.
Improves data ingestion performance.
Enhances query performance in AsterixDB.
Abstract
Document database systems store self-describing semi-structured records, such as JSON, "as-is" without requiring the users to pre-define a schema. This provides users with the flexibility to change the structure of incoming records without worrying about taking the system offline or hindering the performance of currently running queries. However, the flexibility of such systems does not free. The large amount of redundancy in the records can introduce an unnecessary storage overhead and impact query performance. Our focus in this paper is to address the storage overhead issue by introducing a tuple compactor framework that infers and extracts the schema from self-describing semi-structured records during the data ingestion. As many prominent document stores, such as MongoDB and Couchbase, adopt Log Structured Merge (LSM) trees in their storage engines, our framework exploits LSM…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Database Systems and Queries · Cloud Computing and Resource Management · Data Quality and Management
