Cost-efficient Data Acquisition on Online Data Marketplaces for Correlation Analysis
Yanying Li, Haipei Sun, Boxiang Dong, Hui (Wendy) Wang

TL;DR
This paper introduces DANCE, a middleware system that efficiently acquires high-quality, correlated data instances from online data marketplaces for analysis, balancing data quality, budget constraints, and informativeness.
Contribution
It proposes a novel two-phase framework with a join graph construction and a heuristic search algorithm for cost-effective data acquisition in dirty, paid data marketplaces.
Findings
The heuristic algorithm is efficient and effective on benchmark datasets.
DANCE successfully balances data quality, cost, and correlation objectives.
The problem is NP-hard, justifying the heuristic approach.
Abstract
Incentivized by the enormous economic profits, the data marketplace platform has been proliferated recently. In this paper, we consider the data marketplace setting where a data shopper would like to buy data instances from the data marketplace for correlation analysis of certain attributes. We assume that the data in the marketplace is dirty and not free. The goal is to find the data instances from a large number of datasets in the marketplace whose join result not only is of high-quality and rich join informativeness, but also delivers the best correlation between the requested attributes. To achieve this goal, we design DANCE, a middleware that provides the desired data acquisition service. DANCE consists of two phases: (1) In the off-line phase, it constructs a two-layer join graph from samples. The join graph consists of the information of the datasets in the marketplace at both…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Data Management and Algorithms · Data Mining Algorithms and Applications
