Automatically Generating Numerous Context-Driven SFT Data for LLMs across Diverse Granularity
Shanghaoran Quan

TL;DR
This paper introduces AugCon, a novel automated method for generating diverse, high-quality context-driven query-response data across multiple granularities to improve supervised fine-tuning of large language models, reducing reliance on costly human annotation.
Contribution
AugCon combines a recursive query generation approach with a contrastive learning scorer and self-improving techniques to produce diverse, high-fidelity training data for LLMs across various contexts.
Findings
AugCon outperforms state-of-the-art methods in diversity and quality of generated data.
Extensive evaluations confirm high fidelity and applicability across multiple benchmarks.
The method is effective for both English and Chinese datasets.
Abstract
Constructing high-quality query-response pairs from custom corpus is crucial for supervised fine-tuning (SFT) large language models (LLMs) in many applications, like creating domain-specific AI assistants or roleplaying agents. However, sourcing this data through human annotation is costly, and existing automated methods often fail to capture the diverse range of contextual granularity and tend to produce homogeneous data. To tackle these issues, we introduce a novel method named AugCon, capable of automatically generating context-driven SFT data across multiple levels of granularity with high diversity, quality and fidelity. AugCon begins by generating queries using the Context-Split-Tree (CST), an innovative approach for recursively deriving queries and splitting context to cover full granularity. Then, we train a scorer through contrastive learning to collaborate with CST to rank and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSemantic Web and Ontologies · Scientific Computing and Data Management
MethodsShrink and Fine-Tune · Contrastive Learning
