SWOOP: Top-k Similarity Joins over Set Streams
Willi Mann, Nikolaus Augsten, Christian S. Jensen

TL;DR
SWOOP is a scalable algorithm designed to efficiently maintain the top-$k$ most similar set pairs in rapid data streams, such as Twitter, by using novel indexing and filtering techniques to handle dynamic updates.
Contribution
The paper introduces SWOOP, a new stream join algorithm that efficiently maintains top-$k$ similar set pairs in high-speed data streams with minimal memory usage.
Findings
SWOOP outperforms existing methods by handling much faster stream rates.
It uses novel indexing and filtering to prune useless pairs efficiently.
SWOOP maintains a minimal stock of similar pairs for quick updates.
Abstract
We provide efficient support for applications that aim to continuously find pairs of similar sets in rapid streams of sets. A prototypical example setting is that of tweets. A tweet is a set of words, and Twitter emits about half a billion tweets per day. Our solution makes it possible to efficiently maintain the top- most similar tweets from a pair of rapid Twitter streams, e.g., to discover similar trends in two cities if the streams concern cities. Using a sliding window model, the top- result changes as new sets in the stream enter the window or existing ones leave the window. Maintaining the top- result under rapid streams is challenging. First, when a set arrives, it may form a new pair for the top- result with any set already in the window. Second, when a set leaves the window, all its pairings in the top- are invalidated and must be replaced. It is not enough to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Management and Algorithms · Data Quality and Management · Advanced Database Systems and Queries
