Frequency Estimation in Data Streams: Learning the Optimal Hashing Scheme
Dimitris Bertsimas, Vassilis Digalakis Jr

TL;DR
This paper introduces a machine learning-based method for frequency estimation in data streams that optimizes hashing schemes using observed data, significantly improving accuracy over existing algorithms.
Contribution
It proposes a novel optimization and machine learning framework for hashing in data streams, enabling near-optimal frequency distribution compression and improved estimation accuracy.
Findings
Outperforms existing methods by 1-2 orders of magnitude in estimation error
Achieves 45-90% reduction in expected estimation error
Develops an efficient algorithm with linear-time exact solutions in certain cases
Abstract
We present a novel approach for the problem of frequency estimation in data streams that is based on optimization and machine learning. Contrary to state-of-the-art streaming frequency estimation algorithms, which heavily rely on random hashing to maintain the frequency distribution of the data steam using limited storage, the proposed approach exploits an observed stream prefix to near-optimally hash elements and compress the target frequency distribution. We develop an exact mixed-integer linear optimization formulation, which enables us to compute optimal or near-optimal hashing schemes for elements seen in the observed stream prefix; then, we use machine learning to hash unseen elements. Further, we develop an efficient block coordinate descent algorithm, which, as we empirically show, produces high quality solutions, and, in a special case, we are able to solve the proposed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
