Optimal parameters for bloom-filtered joins in Spark
Ophir Lojkine

TL;DR
This paper introduces an algorithm that determines the optimal size of Bloom filters for distributed join operations in Spark, significantly improving performance over previous methods and default SparkSQL.
Contribution
It develops a mathematical model to find optimal Bloom filter parameters, enhancing join efficiency in distributed databases beyond existing fixed-size approaches.
Findings
Optimal Bloom filter sizing improves join performance
Algorithm outperforms previous Bloom filter methods
Enhanced efficiency on TPC-H benchmark in Spark
Abstract
In this paper, we present an algorithm that joins relational database tables efficiently in a distributed environment using Bloom filters of an optimal size. We propose not to use fixed-size bloom filters as in previous research, but to find an optimal size for the bloom filters, by creating a mathematical model of the join algorithm, and then finding the optimal parameters using traditional mathematical optimization. This algorithm with optimal parameters beats both previous approaches using bloom filters and the default SparkSQL engine not only on star-joins, but also on traditional database schema. The experiments were conducted on a standard TPC-H database stored as parquet files on a distributed file system.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCaching and Content Delivery · Network Security and Intrusion Detection · Network Packet Processing and Optimization
