The SpaceSaving$\pm$ Family of Algorithms for Data Streams with Bounded Deletions
Fuheng Zhao, Divyakant Agrawal, Amr El Abbadi, Claire Mathieu, Ahmed, Metwally, Michel de Rougemont

TL;DR
This paper introduces a family of SpaceSaving$m{ ext{ extpm}}$ algorithms for efficient frequency estimation in data streams with bounded deletions, offering improved accuracy, space efficiency, and mergeability for distributed processing.
Contribution
The paper defines the SpaceSaving$m{ ext{ extpm}}$ family, proposes three new algorithms with different trade-offs, and proves their correctness, error bounds, and mergeability in the bounded deletion model.
Findings
Algorithms achieve near-optimal space usage.
Errors are independent of hot items in skewed data.
All algorithms satisfy mergeability for distributed applications.
Abstract
In this paper, we present an advanced analysis of near optimal algorithms that use limited space to solve the frequency estimation, heavy hitters, frequent items, and top-k approximation in the bounded deletion model. We define the family of SpaceSaving algorithms and explain why the original SpaceSaving algorithm only works when insertions and deletions are not interleaved. Next, we propose the new Double SpaceSaving, Unbiased Double SpaceSaving, and Integrated SpaceSaving and prove their correctness. The three proposed algorithms represent different trade-offs, in which Double SpaceSaving can be extended to provide unbiased estimations while Integrated SpaceSaving uses less space. Since data streams are often skewed, we present an improved analysis of these algorithms and show that errors do not depend on the hot items. We also demonstrate how to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed systems and fault tolerance · Advanced Data Storage Technologies · Advanced Database Systems and Queries
