Similarity Group-by Operators for Multi-dimensional Relational Data
Mingjie Tang, Ruby Y.Tahboub, Walid G.Are, Mikhail J. Atallah,, Qutaibah M. Malluhi, Mourad Ouzzani, and Yasin N. Silva

TL;DR
This paper introduces two new similarity-based group-by operators for multidimensional data in SQL, enabling more meaningful grouping by considering attribute correlations, with minimal overhead and significant performance improvements.
Contribution
The paper presents novel multidimensional similarity group-by operators, addressing limitations of existing methods by considering attribute correlations and demonstrating efficient implementation in PostgreSQL.
Findings
Achieves up to 1000x performance improvement over baseline methods.
Minimal overhead with execution times comparable to standard SQL Group-by.
Effective in real-world datasets like TPC-H and social check-in data.
Abstract
The SQL group-by operator plays an important role in summarizing and aggregating large datasets in a data analytic stack.While the standard group-by operator, which is based on equality, is useful in several applications, allowing similarity aware grouping provides a more realistic view on real-world data that could lead to better insights. The Similarity SQL-based Group-By operator (SGB, for short) extends the semantics of the standard SQL Group-by by grouping data with similar but not necessarily equal values. While existing similarity-based grouping operators efficiently materialize this approximate semantics, they primarily focus on one-dimensional attributes and treat multidimensional attributes independently. However, correlated attributes, such as in spatial data, are processed independently, and hence, groups in the multidimensional space are not detected properly. To address…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Management and Algorithms · Advanced Clustering Algorithms Research · Data Mining Algorithms and Applications
