Selection of LLM Fine-Tuning Data based on Orthogonal Rules

Xiaomin Li; Mingye Gao; Zhiwei Zhang; Chang Yue; Hong Hu

arXiv:2410.04715·cs.CL·November 12, 2025

Selection of LLM Fine-Tuning Data based on Orthogonal Rules

Xiaomin Li, Mingye Gao, Zhiwei Zhang, Chang Yue, Hong Hu

PDF

Open Access 1 Video

TL;DR

This paper introduces a rule-based data selection framework for LLM fine-tuning that uses orthogonality metrics and DPP to select diverse, high-quality training samples, improving model performance across multiple domains.

Contribution

The paper presents a novel automated pipeline that generates, evaluates, and selects complementary rules for data quality assessment using orthogonality and DPP, enhancing data selection for LLM fine-tuning.

Findings

01

Improved rating accuracy of data quality assessments.

02

Enhanced downstream LLM performance on multiple tasks.

03

Consistent gains across diverse domains.

Abstract

High-quality training data is critical to the performance of large language models (LLMs). Recent work has explored using LLMs to rate and select data based on a small set of human-designed criteria (rules), but these approaches often rely heavily on heuristics, lack principled metrics for rule evaluation, and generalize poorly to new tasks. We propose a novel rule-based data selection framework that introduces a metric based on the orthogonality of rule score vectors to evaluate and select complementary rules. Our automated pipeline first uses LLMs to generate diverse rules covering multiple aspects of data quality, then rates samples according to these rules and applies the determinantal point process (DPP) to select the most independent rules. These rules are then used to score the full dataset, and high-scoring samples are selected for downstream tasks such as LLM fine-tuning. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Selection of LLM Fine-Tuning Data Based on Orthogonal Rules· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling

MethodsSparse Evolutionary Training