GraphAr: An Efficient Storage Scheme for Graph Data in Data Lakes
Xue Li, Weibin Zeng, Zhibin Wang, Diwen Zhu, Jingbo Xu, Wenyuan Yu,, Jingren Zhou

TL;DR
GraphAr is a novel storage scheme that significantly enhances data lake capabilities for graph data by enabling efficient graph-specific operations, outperforming traditional methods with substantial speedups.
Contribution
This paper introduces GraphAr, a specialized storage scheme that captures LPG semantics and optimizes graph data operations within data lakes, a novel approach in this context.
Findings
4452x faster neighbor retrieval
14.8x faster label filtering
29.5x improvement in end-to-end workloads
Abstract
Data lakes, increasingly adopted for their ability to store and analyze diverse types of data, commonly use columnar storage formats like Parquet and ORC for handling relational tables. However, these traditional setups fall short when it comes to efficiently managing graph data, particularly those conforming to the Labeled Property Graph (LPG) model. To address this gap, this paper introduces GraphAr, a specialized storage scheme designed to enhance existing data lakes for efficient graph data management. Leveraging the strengths of Parquet, GraphAr captures LPG semantics precisely and facilitates graph-specific operations such as neighbor retrieval and label filtering. Through innovative data organization, encoding, and decoding techniques, GraphAr dramatically improves performance. Our evaluations reveal that GraphAr outperforms conventional Parquet and Acero-based methods, achieving…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Graph Theory and Algorithms · Advanced Database Systems and Queries
