A tool framework for tweaking features in synthetic datasets

J.W. Zhang; Y.C. Tay

arXiv:1801.03645·cs.DB·January 12, 2018

A tool framework for tweaking features in synthetic datasets

J.W. Zhang, Y.C. Tay

PDF

Open Access

TL;DR

This paper introduces ASPECT, a flexible framework for scaling and customizing synthetic datasets by tweaking features to match desired properties, improving over traditional fixed-feature approaches.

Contribution

The paper presents ASPECT, a novel framework that allows flexible feature-based tweaking of scaled datasets, overcoming intractability of fixed-feature methods.

Findings

01

ASPECT effectively enforces dataset similarity.

02

ASPECT is efficient in real dataset experiments.

03

Flexible feature customization is achieved with ASPECT.

Abstract

Researchers and developers use benchmarks to compare their algorithms and products. A database benchmark must have a dataset D. To be application-specific, this dataset D should be empirical. However, D may be too small, or too large, for the benchmarking experiments. D must, therefore, be scaled to the desired size. To ensure the scaled D' is similar to D, previous work typically specifies or extracts a fixed set of features F = {F_1, F_2, . . . , F_n} from D, then uses F to generate synthetic data for D'. However, this approach (D -> F -> D') becomes increasingly intractable as F gets larger, so a new solution is necessary. Different from existing approaches, this paper proposes ASPECT to scale D to enforce similarity. ASPECT first uses a size-scaler (S0) to scale D to D'. Then the user selects a set of desired features F'_1, . . . , F'_n. For each desired feature F'_k, there is a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Database Systems and Queries · Data Management and Algorithms · Web Data Mining and Analysis