On Finding Similar Items in a Stream of Transactions
Andrea Campagna, Rasmus Pagh

TL;DR
This paper explores the problem of finding similar item pairs in transaction data streams, demonstrating that under a random order model, accurate similarity mining is feasible with small space and improves over time.
Contribution
It introduces the first approach to similarity mining in data streams, providing theoretical insights and showing that accuracy improves with stream length under certain assumptions.
Findings
Small-space similarity mining is possible for common measures.
Mining accuracy improves as the stream length increases.
Negative results highlight the space complexity of frequent itemset algorithms.
Abstract
While there has been a lot of work on finding frequent itemsets in transaction data streams, none of these solve the problem of finding similar pairs according to standard similarity measures. This paper is a first attempt at dealing with this, arguably more important, problem. We start out with a negative result that also explains the lack of theoretical upper bounds on the space usage of data mining algorithms for finding frequent itemsets: Any algorithm that (even only approximately and with a chance of error) finds the most frequent k-itemset must use space Omega(min{mb,n^k,(mb/phi)^k}) bits, where mb is the number of items in the stream so far, n is the number of distinct items and phi is a support threshold. To achieve any non-trivial space upper bound we must thus abandon a worst-case assumption on the data stream. We work under the model that the transactions come in random…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Mining Algorithms and Applications · Rough Sets and Fuzzy Logic · Advanced Database Systems and Queries
