Selection of LLM Fine-Tuning Data based on Orthogonal Rules
Xiaomin Li, Mingye Gao, Zhiwei Zhang, Chang Yue, Hong Hu

TL;DR
This paper introduces a rule-based data selection framework for LLM fine-tuning that uses orthogonality metrics and DPP to select diverse, high-quality training samples, improving model performance across multiple domains.
Contribution
The paper presents a novel automated pipeline that generates, evaluates, and selects complementary rules for data quality assessment using orthogonality and DPP, enhancing data selection for LLM fine-tuning.
Findings
Improved rating accuracy of data quality assessments.
Enhanced downstream LLM performance on multiple tasks.
Consistent gains across diverse domains.
Abstract
High-quality training data is critical to the performance of large language models (LLMs). Recent work has explored using LLMs to rate and select data based on a small set of human-designed criteria (rules), but these approaches often rely heavily on heuristics, lack principled metrics for rule evaluation, and generalize poorly to new tasks. We propose a novel rule-based data selection framework that introduces a metric based on the orthogonality of rule score vectors to evaluate and select complementary rules. Our automated pipeline first uses LLMs to generate diverse rules covering multiple aspects of data quality, then rates samples according to these rules and applies the determinantal point process (DPP) to select the most independent rules. These rules are then used to score the full dataset, and high-scoring samples are selected for downstream tasks such as LLM fine-tuning. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsSparse Evolutionary Training
