Optimal parameters for bloom-filtered joins in Spark

Ophir Lojkine

arXiv:1706.02785·cs.DC·June 13, 2017·1 cites

Optimal parameters for bloom-filtered joins in Spark

Ophir Lojkine

PDF

Open Access

TL;DR

This paper introduces an algorithm that determines the optimal size of Bloom filters for distributed join operations in Spark, significantly improving performance over previous methods and default SparkSQL.

Contribution

It develops a mathematical model to find optimal Bloom filter parameters, enhancing join efficiency in distributed databases beyond existing fixed-size approaches.

Findings

01

Optimal Bloom filter sizing improves join performance

02

Algorithm outperforms previous Bloom filter methods

03

Enhanced efficiency on TPC-H benchmark in Spark

Abstract

In this paper, we present an algorithm that joins relational database tables efficiently in a distributed environment using Bloom filters of an optimal size. We propose not to use fixed-size bloom filters as in previous research, but to find an optimal size for the bloom filters, by creating a mathematical model of the join algorithm, and then finding the optimal parameters using traditional mathematical optimization. This algorithm with optimal parameters beats both previous approaches using bloom filters and the default SparkSQL engine not only on star-joins, but also on traditional database schema. The experiments were conducted on a standard TPC-H database stored as parquet files on a distributed file system.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCaching and Content Delivery · Network Security and Intrusion Detection · Network Packet Processing and Optimization