Robust and Scalable Content-and-Structure Indexing (Extended Version)

Kevin Wellenzohn; Michael H. B\"ohlen; Sven Helmer; Antoine Pietri,; Stefano Zacchiroli

arXiv:2209.05126·cs.DB·September 13, 2022

Robust and Scalable Content-and-Structure Indexing (Extended Version)

Kevin Wellenzohn, Michael H. B\"ohlen, Sven Helmer, Antoine Pietri,, Stefano Zacchiroli

PDF

Open Access

TL;DR

The paper introduces RSCAS, a robust and scalable index for efficient Content-and-Structure queries on large semi-structured data, combining dynamic interleaving and trie-based storage within an LSM tree.

Contribution

It presents a novel dynamic interleaving technique and a trie-based RSCAS index implemented as an LSM tree for scalable, robust CAS query processing.

Findings

01

Successfully indexes the Software Heritage archive.

02

Supports a wide range of CAS queries including wildcards.

03

Demonstrates robustness and scalability in large data environments.

Abstract

Frequent queries on semi-structured hierarchical data are Content-and-Structure (CAS) queries that filter data items based on their location in the hierarchical structure and their value for some attribute. We propose the Robust and Scalable Content-and-Structure (RSCAS) index to efficiently answer CAS queries on big semi-structured data. To get an index that is robust against queries with varying selectivities we introduce a novel dynamic interleaving that merges the path and value dimensions of composite keys in a balanced manner. We store interleaved keys in our trie-based RSCAS index, which efficiently supports a wide range of CAS queries, including queries with wildcards and descendant axes. We implement RSCAS as a log-structured merge (LSM) tree to scale it to data-intensive applications with a high insertion rate. We illustrate RSCAS's robustness and scalability by indexing data…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Management and Algorithms · Advanced Database Systems and Queries · Semantic Web and Ontologies