Estimating Correlation Clustering Cost in Node-Arrival Stream
Kaiwen Liu, Seba Daniela Villalobos, Qin Zhang

TL;DR
This paper introduces a streaming algorithm for correlation clustering in node-arrival models, achieving near-optimal performance with minimal memory and multiple passes, suitable for real-world data streams.
Contribution
The paper proposes C$^4$Approx, a novel streaming algorithm for correlation clustering in node-arrival streams, with theoretical guarantees and practical efficiency.
Findings
Achieves sublinear space approximation of correlation clustering cost.
Performs comparably to classic algorithms using only 2% node storage.
Works effectively on real-world datasets with sparse graphs.
Abstract
We study the correlation clustering problem in the node-arrival data stream model. Unlike previous work, where the stream consists of the graph's edges, we focus on the setting in which the stream contains only the nodes. This model better reflects many real-world scenarios in which the data stream naturally consists of raw objects (e.g., images, tweets), and the similar/dissimilar edges are derived through a similarity function. We present CApprox, a streaming algorithm that approximates the cost of correlation clustering using sublinear space in the number of nodes and a constant number of passes. We further complement this result with lower bounds. Experiments on real-world datasets show that by storing only 2% of the nodes, our algorithm achieves performance comparable to the classic Pivot algorithm and the more recent PrunedPivot algorithm, even on sparse graphs.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
