Generating Skyline Datasets for Data Science Models
Mengying Wang, Hanchao Ma, Yiyang Bian, Yangxin Fan, Yinghui Wu

TL;DR
This paper presents MODis, a framework for discovering skyline datasets that optimize multiple model performance measures, enhancing data quality for AI models without bias from single-metric approaches.
Contribution
The paper introduces MODis, a novel multi-goal dataset discovery framework formulated as a finite state transducer with three algorithms for efficient skyline dataset generation.
Findings
Algorithms effectively generate diverse skyline datasets
MODis improves model performance across multiple measures
Experimental results show efficiency and applicability in data pipelines
Abstract
Preparing high-quality datasets required by various data-driven AI and machine learning models has become a cornerstone task in data-driven analysis. Conventional data discovery methods typically integrate datasets towards a single pre-defined quality measure that may lead to bias for downstream tasks. This paper introduces MODis, a framework that discovers datasets by optimizing multiple user-defined, model-performance measures. Given a set of data sources and a model, MODis selects and integrates data sources into a skyline dataset, over which the model is expected to have the desired performance in all the performance measures. We formulate MODis as a multi-goal finite state transducer, and derive three feasible algorithms to generate skyline datasets. Our first algorithm adopts a "reduce-from-universal" strategy, that starts with a universal schema and iteratively prunes unpromising…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Management and Algorithms · Geographic Information Systems Studies · Human Mobility and Location-Based Analysis
MethodsSparse Evolutionary Training
