Frequency Estimation in Data Streams: Learning the Optimal Hashing   Scheme

Dimitris Bertsimas; Vassilis Digalakis Jr

arXiv:2007.09261·cs.DS·July 19, 2022

Frequency Estimation in Data Streams: Learning the Optimal Hashing Scheme

Dimitris Bertsimas, Vassilis Digalakis Jr

PDF

TL;DR

This paper introduces a machine learning-based method for frequency estimation in data streams that optimizes hashing schemes using observed data, significantly improving accuracy over existing algorithms.

Contribution

It proposes a novel optimization and machine learning framework for hashing in data streams, enabling near-optimal frequency distribution compression and improved estimation accuracy.

Findings

01

Outperforms existing methods by 1-2 orders of magnitude in estimation error

02

Achieves 45-90% reduction in expected estimation error

03

Develops an efficient algorithm with linear-time exact solutions in certain cases

Abstract

We present a novel approach for the problem of frequency estimation in data streams that is based on optimization and machine learning. Contrary to state-of-the-art streaming frequency estimation algorithms, which heavily rely on random hashing to maintain the frequency distribution of the data steam using limited storage, the proposed approach exploits an observed stream prefix to near-optimally hash elements and compress the target frequency distribution. We develop an exact mixed-integer linear optimization formulation, which enables us to compute optimal or near-optimal hashing schemes for elements seen in the observed stream prefix; then, we use machine learning to hash unseen elements. Further, we develop an efficient block coordinate descent algorithm, which, as we empirically show, produces high quality solutions, and, in a special case, we are able to solve the proposed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.