# Sampling Sketches for Concave Sublinear Functions of Frequencies

**Authors:** Edith Cohen, Ofir Geri

arXiv: 1907.02218 · 2019-12-24

## TL;DR

This paper introduces efficient, scalable sampling sketches for estimating statistics involving concave sublinear functions of key frequencies in massive distributed datasets, with strong theoretical guarantees and practical effectiveness.

## Contribution

We develop composable sampling sketches tailored to any concave sublinear frequency function, achieving near-optimal size and high-quality statistical estimates.

## Key findings

- Sketches are close in size to the target sample
- Samples provide guarantees near those of ideal samples
- Experimental results show simplicity and effectiveness

## Abstract

We consider massive distributed datasets that consist of elements modeled as key-value pairs and the task of computing statistics or aggregates where the contribution of each key is weighted by a function of its frequency (sum of values of its elements). This fundamental problem has a wealth of applications in data analytics and machine learning, in particular, with concave sublinear functions of the frequencies that mitigate the disproportionate effect of keys with high frequency. The family of concave sublinear functions includes low frequency moments ($p \leq 1$), capping, logarithms, and their compositions. A common approach is to sample keys, ideally, proportionally to their contributions and estimate statistics from the sample. A simple but costly way to do this is by aggregating the data to produce a table of keys and their frequencies, apply our function to the frequency values, and then apply a weighted sampling scheme. Our main contribution is the design of composable sampling sketches that can be tailored to any concave sublinear function of the frequencies. Our sketch structure size is very close to the desired sample size and our samples provide statistical guarantees on the estimation quality that are very close to that of an ideal sample of the same size computed over aggregated data. Finally, we demonstrate experimentally the simplicity and effectiveness of our methods.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1907.02218/full.md

## References

39 references — full list in the complete paper: https://tomesphere.com/paper/1907.02218/full.md

---
Source: https://tomesphere.com/paper/1907.02218