Hierarchical Dataset Selection for High-Quality Data Sharing
Xiaona Zhou, Yingyan Zeng, Ran Jin, Ismini Lourentzou

TL;DR
This paper introduces DaSH, a hierarchical dataset selection method that improves data quality and model performance by selecting entire datasets based on their relevance and utility, outperforming existing methods.
Contribution
The paper formalizes the dataset selection task and proposes DaSH, a hierarchical approach that models utility at dataset and group levels for efficient selection.
Findings
DaSH outperforms state-of-the-art baselines by up to 26.2% in accuracy.
DaSH requires fewer exploration steps and is robust in low-resource settings.
The method is effective across multiple public benchmarks.
Abstract
The success of modern machine learning hinges on access to high-quality training data. In many real-world scenarios, such as acquiring data from public repositories or sharing across institutions, data is naturally organized into discrete datasets that vary in relevance, quality, and utility. Selecting which repositories or institutions to search for useful datasets, and which datasets to incorporate into model training are therefore critical decisions, yet most existing methods select individual samples and treat all data as equally relevant, ignoring differences between datasets and their sources. In this work, we formalize the task of dataset selection: selecting entire datasets from a large, heterogeneous pool to improve downstream performance under resource constraints. We propose Dataset Selection via Hierarchies (DaSH), a dataset selection method that models utility at both…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsData Quality and Management · Machine Learning and Data Classification · Advanced Graph Neural Networks
