S2RDF: RDF Querying with SPARQL on Spark
Alexander Sch\"atzle, Martin Przyjaciel-Zablocki, Simon Skilevic,, Georg Lausen

TL;DR
S2RDF is a scalable system that enables fast SPARQL querying over large RDF datasets on Spark by using a novel ExtVP partitioning schema, significantly improving performance over existing Hadoop-based solutions.
Contribution
The paper introduces ExtVP, a semi-join based RDF partitioning schema, and implements S2RDF on Spark, achieving high-performance SPARQL querying on billion-triple datasets.
Findings
S2RDF achieves sub-second query runtimes on billion-triple RDF graphs.
ExtVP reduces query input size regardless of pattern shape.
S2RDF outperforms state-of-the-art Hadoop-based SPARQL systems.
Abstract
RDF has become very popular for semantic data publishing due to its flexible and universal graph-like data model. Yet, the ever-increasing size of RDF data collections makes it more and more infeasible to store and process them on a single machine, raising the need for distributed approaches. Instead of building a standalone but closed distributed RDF store, we endorse the usage of existing infrastructures for Big Data processing, e.g. Hadoop. However, SPARQL query performance is a major challenge as these platforms are not designed for RDF processing from ground. Thus, existing Hadoop-based approaches often favor certain query pattern shape while performance drops significantly for other shapes. In this paper, we describe a novel relational partitioning schema for RDF data called ExtVP that uses a semi-join based preprocessing, akin to the concept of Join Indices in relational…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Advanced Database Systems and Queries · Data Quality and Management
