The Best of Both Worlds: Bridging Quality and Diversity in Data Selection with Bipartite Graph

Minghao Wu; Thuy-Trang Vu; Lizhen Qu; Gholamreza Haffari

arXiv:2410.12458·cs.CL·May 28, 2025

The Best of Both Worlds: Bridging Quality and Diversity in Data Selection with Bipartite Graph

Minghao Wu, Thuy-Trang Vu, Lizhen Qu, Gholamreza Haffari

PDF

Open Access 1 Video

TL;DR

This paper introduces GraphFilter, a bipartite graph-based data selection method that effectively balances quality and diversity, improving fine-tuning outcomes for large language models across multiple benchmarks.

Contribution

We propose GraphFilter, a novel set cover approach that models data as a bipartite graph and combines quality and diversity metrics for superior data subset selection.

Findings

01

Outperforms nine baselines in model performance

02

Enhances computational efficiency in data selection

03

Highlights the importance of instruction diversity

Abstract

The performance of large language models (LLMs) is strongly influenced by the quality and diversity of data used during supervised fine-tuning (SFT). However, current data selection methods often prioritize one aspect over the other, resulting in suboptimal training outcomes. To address this, we formulate data selection as a set cover problem and present GraphFilter, a novel approach that balances both quality and diversity in data selection. GraphFilter models the dataset as a bipartite graph connecting sentences to their constituent n-grams, then employs a priority function that combines quality and diversity metrics multiplicatively. GraphFilter iteratively selects sentences with the highest priority, removes covered n-grams from the bipartite graph, and recomputes priorities to reflect the changing data landscape. We validate GraphFilter using three model backbones across six…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

The Best of Both Worlds: Bridging Quality and Diversity in Data Selection with Bipartite Graph· slideslive

Taxonomy

TopicsData Mining Algorithms and Applications · Data Management and Algorithms

MethodsFocus