Advanced Bloom Filter Based Algorithms for Efficient Approximate Data De-Duplication in Streams
Suman K. Bera, Sourav Dutta, Ankur Narang, Souvik Bhattacherjee

TL;DR
This paper introduces novel Bloom Filter algorithms, including RSBF and BSBF variants, for efficient approximate duplicate detection in streaming data, with theoretical bounds and empirical validation on large datasets.
Contribution
It presents new Bloom Filter-based algorithms with theoretical analysis and demonstrates their superior performance over existing methods in large-scale data streams.
Findings
Proposed algorithms outperform existing duplicate detection methods.
Theoretical bounds on false positive, false negative, and convergence rates are established.
Empirical results on real and synthetic datasets validate the effectiveness of the models.
Abstract
Applications involving telecommunication call data records, web pages, online transactions, medical records, stock markets, climate warning systems, etc., necessitate efficient management and processing of such massively exponential amount of data from diverse sources. De-duplication or Intelligent Compression in streaming scenarios for approximate identification and elimination of duplicates from such unbounded data stream is a greater challenge given the real-time nature of data arrival. Stable Bloom Filters (SBF) addresses this problem to a certain extent. . In this work, we present several novel algorithms for the problem of approximate detection of duplicates in data streams. We propose the Reservoir Sampling based Bloom Filter (RSBF) combining the working principle of reservoir sampling and Bloom Filters. We also present variants of the novel Biased Sampling based Bloom Filter…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCaching and Content Delivery · Data Quality and Management · Privacy-Preserving Technologies in Data
