In-stream Probabilistic Cardinality Estimation for Bloom Filters
Remy Scholler, Jean-Francois Couchot, Oumaima Alaoui-Ismaili, Denis, Renaud, Eric Ballot

TL;DR
This paper introduces a probabilistic method to improve the accuracy and reduce the variance of cardinality estimation in Bloom filters for streaming data, with practical evaluation on real mobility datasets.
Contribution
It proposes a novel in-stream approach to estimate and optimize the counting error of Bloom filter-based cardinality, significantly reducing variance compared to existing methods.
Findings
Achieves at least as accurate average estimates as state-of-the-art methods.
Reduces variance of cardinality estimates by approximately 6 to 7 times.
Validated on real mobility data from mobile network records.
Abstract
The amount of data coming from different sources such as IoT-sensors, social networks, cellular networks, has increased exponentially during the last few years. Probabilistic Data Structures (PDS) are efficient alternatives to deterministic data structures suitable for large data processing and streaming applications. They are mainly used for approximate membership queries, frequency count, cardinality estimation and similarity research. Finding the number of distinct elements in a large dataset or in streaming data is an active research area. In this work, we show that usual methods based on Bloom filters for this kind of cardinality estimation are relatively accurate on average but have a high variance. Therefore, reducing this variance is interesting to obtain accurate statistics. We propose a probabilistic approach to estimate more accurately the cardinality of a Bloom filter based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCaching and Content Delivery · Human Mobility and Location-Based Analysis · Internet Traffic Analysis and Secure E-voting
