Real-Time Device Reach Forecasting Using HLL and MinHash Data Sketches
Chandrashekar Muniyappa, Kendall Willets, Sriraman Krishnamoorthy

TL;DR
This paper introduces a real-time device reach forecasting system using optimized MinHash and HyperLogLog data sketches, enabling fast, accurate predictions for advertising without extensive offline processing.
Contribution
The work presents a novel real-time prediction system that improves MinHash algorithms with SIMD vectorization and addresses multilevel aggregation challenges, achieving high speed and accuracy.
Findings
Achieves real-time device reach predictions within 5% error rate.
Runs 4 times faster than traditional MinHash implementations.
Maintains accuracy comparable to offline systems.
Abstract
Predicting the right number of TVs (Device Reach) in real-time based on a user-specified targeting attributes is imperative for running multi-million dollar ADs business. The traditional approach of SQL queries to join billions of records across multiple targeting dimensions is extremely slow. As a workaround, many applications will have an offline process to crunch these numbers and present the results after many hours. In our case, the solution was an offline process taking 24 hours to onboard a customer resulting in a potential loss of business. To solve this problem, we have built a new real-time prediction system using MinHash and HyperLogLog (HLL) data sketches to compute the device reach at runtime when a user makes a request. However, existing MinHash implementations do not solve the complex problem of multilevel aggregation and intersection. This work will show how we have…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
