Fishing in the Stream: Similarity Search over Endless Data

Naama Kraus; David Carmel; Idit Keidar

arXiv:1708.02062·cs.IR·August 8, 2017

Fishing in the Stream: Similarity Search over Endless Data

Naama Kraus, David Carmel, Idit Keidar

PDF

TL;DR

This paper introduces Stream-LSH, a novel algorithm for similarity search over endless data streams that considers data quality and temporal factors, effectively managing unbounded data with bounded resources.

Contribution

The paper proposes Stream-LSH, a new randomized algorithm that bounds index size by prioritizing data based on freshness, quality, and popularity, improving search probability.

Findings

01

Stream-LSH outperforms alternative methods in finding similar items.

02

Empirical results confirm theoretical advantages of Stream-LSH.

03

The approach effectively manages unbounded data with limited resources.

Abstract

Similarity search is the task of retrieving data items that are similar to a given query. In this paper, we introduce the time-sensitive notion of similarity search over endless data-streams (SSDS), which takes into account data quality and temporal characteristics in addition to similarity. SSDS is challenging as it needs to process unbounded data, while computation resources are bounded. We propose Stream-LSH, a randomized SSDS algorithm that bounds the index size by retaining items according to their freshness, quality, and dynamic popularity attributes. We analytically show that Stream-LSH increases the probability to find similar items compared to alternative approaches using the same space capacity. We further conduct an empirical study using real world stream datasets, which confirms our theoretical results.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.