Towards "Intelligent Compression" in Streams: A Biased Reservoir Sampling based Bloom Filter Approach
Sourav Dutta, Souvik Bhattacherjee, Ankur Narang

TL;DR
This paper introduces a Reservoir Sampling based Bloom Filter (RSBF) that improves duplicate detection in data streams by reducing false negatives and convergence time, outperforming Stable Bloom Filters while maintaining low memory usage.
Contribution
It presents the first integration of reservoir sampling with Bloom filters for streaming deduplication, providing theoretical bounds and empirical evidence of superior performance.
Findings
Up to 2x reduction in false negative rate compared to SBF
Better convergence rates with similar false positive rates
Low memory requirements for large-scale data streams
Abstract
With the explosion of information stored world-wide,data intensive computing has become a central area of research.Efficient management and processing of this massively exponential amount of data from diverse sources,such as telecommunication call data records,online transaction records,etc.,has become a necessity.Removing redundancy from such huge(multi-billion records) datasets resulting in resource and compute efficiency for downstream processing constitutes an important area of study. "Intelligent compression" or deduplication in streaming scenarios,for precise identification and elimination of duplicates from the unbounded datastream is a greater challenge given the realtime nature of data arrival.Stable Bloom Filters(SBF) address this problem to a certain extent.However,SBF suffers from a high false negative rate(FNR) and slow convergence rate,thereby rendering it inefficient for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCaching and Content Delivery · Advanced Data Storage Technologies · Advanced Steganography and Watermarking Techniques
