Distance Queries from Sampled Data: Accurate and Efficient

Edith Cohen

arXiv:1203.4903·cs.DS·March 20, 2015·1 cites

Distance Queries from Sampled Data: Accurate and Efficient

Edith Cohen

PDF

Open Access

TL;DR

This paper introduces novel, optimal estimators for accurately computing $L_p$ distances from sampled data, enabling scalable and precise analysis of large datasets with minimal sampling.

Contribution

It develops the first effective estimators for $L_p$ distances applicable to common sampling schemes, improving accuracy and scalability in data analysis.

Findings

01

Estimators are Pareto optimal in variance.

02

Accurate distance estimation with small samples.

03

Scalable performance demonstrated on diverse datasets.

Abstract

Distance queries are a basic tool in data analysis. They are used for detection and localization of change for the purpose of anomaly detection, monitoring, or planning. Distance queries are particularly useful when data sets such as measurements, snapshots of a system, content, traffic matrices, and activity logs are collected repeatedly. Random sampling, which can be efficiently performed over streamed or distributed data, is an important tool for scalable data analysis. The sample constitutes an extremely flexible summary, which naturally supports domain queries and scalable estimation of statistics, which can be specified after the sample is generated. The effectiveness of a sample as a summary, however, hinges on the estimators we have. We derive novel estimators for estimating $L_{p}$ distance from sampled data. Our estimators apply with the most common weighted sampling…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Management and Algorithms · Algorithms and Data Compression · Machine Learning and Algorithms