Distance Queries from Sampled Data: Accurate and Efficient
Edith Cohen

TL;DR
This paper introduces novel, optimal estimators for accurately computing $L_p$ distances from sampled data, enabling scalable and precise analysis of large datasets with minimal sampling.
Contribution
It develops the first effective estimators for $L_p$ distances applicable to common sampling schemes, improving accuracy and scalability in data analysis.
Findings
Estimators are Pareto optimal in variance.
Accurate distance estimation with small samples.
Scalable performance demonstrated on diverse datasets.
Abstract
Distance queries are a basic tool in data analysis. They are used for detection and localization of change for the purpose of anomaly detection, monitoring, or planning. Distance queries are particularly useful when data sets such as measurements, snapshots of a system, content, traffic matrices, and activity logs are collected repeatedly. Random sampling, which can be efficiently performed over streamed or distributed data, is an important tool for scalable data analysis. The sample constitutes an extremely flexible summary, which naturally supports domain queries and scalable estimation of statistics, which can be specified after the sample is generated. The effectiveness of a sample as a summary, however, hinges on the estimators we have. We derive novel estimators for estimating distance from sampled data. Our estimators apply with the most common weighted sampling…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Management and Algorithms · Algorithms and Data Compression · Machine Learning and Algorithms
