SEED: Targeted Data Selection by Weighted Independent Set

Yuan Zhang; Lifeng Guo; Junwen Pan; Wenzhao Zheng; Wen Zhou; Kuan Cheng; Kurt Keutzer; Shanghang Zhang

arXiv:2605.15691·cs.LG·May 21, 2026

SEED: Targeted Data Selection by Weighted Independent Set

Yuan Zhang, Lifeng Guo, Junwen Pan, Wenzhao Zheng, Wen Zhou, Kuan Cheng, Kurt Keutzer, Shanghang Zhang

PDF

TL;DR

SEED introduces a graph-based data selection method that enhances sample quality and diversity by calibrating influence scores and normalizing local graph scales, leading to improved model performance.

Contribution

The paper proposes a novel Weighted Independent Set formulation with two key refinements—node value calibration and local scale normalization—for robust and scalable data selection.

Findings

01

SEED outperforms existing methods on instruction tuning tasks.

02

The curated Honeybee-Remake-SEED-200K dataset demonstrates SEED's practical utility.

03

Experiments confirm SEED's effectiveness across multiple domains and model types.

Abstract

Data selection seeks to identify a compact yet informative subset from large-scale training corpora, balancing sample quality against collection diversity. We formulate this problem as a Weighted Independent Set (WIS) on a similarity graph, where nodes represent data samples weighted by influence, and edges connect semantically redundant pairs. This formulation naturally yields subsets that are simultaneously high-quality and diverse. However, two challenges arise in practice: naive node weights fail to distinguish informative signals from gradient noise, and edge construction under heterogeneous domain distributions produces structurally imbalanced graphs that bias selection toward sparse regions. To address these issues, we introduce two principled refinements from a unified graph perspective: (1) \textit{node value calibration} that restricts influence estimation to the bilateral…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.