Stream Clustering using Probabilistic Data Structures
Andrei Sorin Sabau

TL;DR
This paper introduces a novel stream clustering algorithm that uses probabilistic data structures like count-min sketches and Bloom filters to efficiently detect arbitrarily shaped clusters with guarantees on accuracy.
Contribution
It proposes a unified sketch-based approach for online density estimation and cluster membership, replacing traditional two-phase clustering schemes.
Findings
Effective detection of arbitrarily shaped clusters
Handles outliers robustly
Demonstrates high efficiency on real and synthetic datasets
Abstract
Most density based stream clustering algorithms separate the clustering process into an online and offline component. Exact summarized statistics are being employed for defining micro-clusters or grid cells during the online stage followed by macro-clustering during the offline stage. This paper proposes a novel alternative to the traditional two phase stream clustering scheme, introducing sketch-based data structures for assessing both stream density and cluster membership with probabilistic accuracy guarantees. A count-min sketch using a damped window model estimates stream density. Bloom filters employing a variation of active-active buffering estimate cluster membership. Instances of both types of sketches share the same set of hash functions. The resulting stream clustering algorithm is capable of detecting arbitrarily shaped clusters while correctly handling outliers and making no…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Stream Mining Techniques · Advanced Clustering Algorithms Research · Data Management and Algorithms
