Streaming Similarity Self-Join
Gianmarco De Francisci Morales, Aristides Gionis

TL;DR
This paper addresses the challenge of computing similarity self-joins in streaming data by introducing time-dependent similarity and developing scalable algorithms, notably the STR framework with the L2 index, for real-time similarity detection.
Contribution
It proposes a novel streaming similarity self-join framework using time-dependent similarity, along with new algorithms and an optimized index for scalable real-time processing.
Findings
STR algorithm with L2 index outperforms other methods in scalability
Time-dependent similarity effectively manages unbounded streaming data
Extensive experiments validate the efficiency and scalability of the proposed approach
Abstract
We introduce and study the problem of computing the similarity self-join in a streaming context (SSSJ), where the input is an unbounded stream of items arriving continuously. The goal is to find all pairs of items in the stream whose similarity is greater than a given threshold. The simplest formulation of the problem requires unbounded memory, and thus, it is intractable. To make the problem feasible, we introduce the notion of time-dependent similarity: the similarity of two items decreases with the difference in their arrival time. By leveraging the properties of this time-dependent similarity function, we design two algorithmic frameworks to solve the sssj problem. The first one, MiniBatch (MB), uses existing index-based filtering techniques for the static version of the problem, and combines them in a pipeline. The second framework, Streaming (STR), adds time filtering to the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Management and Algorithms · Spam and Phishing Detection · Sentiment Analysis and Opinion Mining
