SEED: Targeted Data Selection by Weighted Independent Set
Yuan Zhang, Lifeng Guo, Junwen Pan, Wenzhao Zheng, Wen Zhou, Kuan Cheng, Kurt Keutzer, Shanghang Zhang

TL;DR
SEED introduces a graph-based data selection method that enhances sample quality and diversity by calibrating influence scores and normalizing local graph scales, leading to improved model performance.
Contribution
The paper proposes a novel Weighted Independent Set formulation with two key refinements—node value calibration and local scale normalization—for robust and scalable data selection.
Findings
SEED outperforms existing methods on instruction tuning tasks.
The curated Honeybee-Remake-SEED-200K dataset demonstrates SEED's practical utility.
Experiments confirm SEED's effectiveness across multiple domains and model types.
Abstract
Data selection seeks to identify a compact yet informative subset from large-scale training corpora, balancing sample quality against collection diversity. We formulate this problem as a Weighted Independent Set (WIS) on a similarity graph, where nodes represent data samples weighted by influence, and edges connect semantically redundant pairs. This formulation naturally yields subsets that are simultaneously high-quality and diverse. However, two challenges arise in practice: naive node weights fail to distinguish informative signals from gradient noise, and edge construction under heterogeneous domain distributions produces structurally imbalanced graphs that bias selection toward sparse regions. To address these issues, we introduce two principled refinements from a unified graph perspective: (1) \textit{node value calibration} that restricts influence estimation to the bilateral…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
