Robust and Scalable Content-and-Structure Indexing (Extended Version)
Kevin Wellenzohn, Michael H. B\"ohlen, Sven Helmer, Antoine Pietri,, Stefano Zacchiroli

TL;DR
The paper introduces RSCAS, a robust and scalable index for efficient Content-and-Structure queries on large semi-structured data, combining dynamic interleaving and trie-based storage within an LSM tree.
Contribution
It presents a novel dynamic interleaving technique and a trie-based RSCAS index implemented as an LSM tree for scalable, robust CAS query processing.
Findings
Successfully indexes the Software Heritage archive.
Supports a wide range of CAS queries including wildcards.
Demonstrates robustness and scalability in large data environments.
Abstract
Frequent queries on semi-structured hierarchical data are Content-and-Structure (CAS) queries that filter data items based on their location in the hierarchical structure and their value for some attribute. We propose the Robust and Scalable Content-and-Structure (RSCAS) index to efficiently answer CAS queries on big semi-structured data. To get an index that is robust against queries with varying selectivities we introduce a novel dynamic interleaving that merges the path and value dimensions of composite keys in a balanced manner. We store interleaved keys in our trie-based RSCAS index, which efficiently supports a wide range of CAS queries, including queries with wildcards and descendant axes. We implement RSCAS as a log-structured merge (LSM) tree to scale it to data-intensive applications with a high insertion rate. We illustrate RSCAS's robustness and scalability by indexing data…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Management and Algorithms · Advanced Database Systems and Queries · Semantic Web and Ontologies
