Scalable Relational Query Processing on Big Matrix Data
Yongyang Yu, Mingjie Tang, Walid G. Aref

TL;DR
This paper introduces scalable relational query processing methods for large matrix data in distributed environments, significantly improving performance over existing systems by optimizing query plans and partitioning strategies.
Contribution
It develops novel algebraic transformations, a query optimizer, and partitioning schemes for efficient relational operations directly on big matrix data in distributed clusters.
Findings
Achieves up to 100x performance improvement over state-of-the-art systems.
Demonstrates effectiveness on real and synthetic datasets.
Prototypes in Apache Spark validate the approach.
Abstract
The use of large-scale machine learning methods is becoming ubiquitous in many applications ranging from business intelligence to self-driving cars. These methods require a complex computation pipeline consisting of various types of operations, e.g., relational operations for pre-processing or post-processing the dataset, and matrix operations for core model computations. Many existing systems focus on efficiently processing matrix-only operations, and assume that the inputs to the relational operators are already pre-computed and are materialized as intermediate matrices. However, the input to a relational operator may be complex in machine learning pipelines, and may involve various combinations of matrix operators. Hence, it is critical to realize scalable and efficient relational query processors that directly operate on big matrix data. This paper presents new efficient and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGraph Theory and Algorithms · Advanced Graph Neural Networks · Data Quality and Management
