Automatically Generating Numerous Context-Driven SFT Data for LLMs   across Diverse Granularity

Shanghaoran Quan

arXiv:2405.16579·cs.CL·May 28, 2024·1 cites

Automatically Generating Numerous Context-Driven SFT Data for LLMs across Diverse Granularity

Shanghaoran Quan

PDF

Open Access 1 Repo 1 Models 1 Video

TL;DR

This paper introduces AugCon, a novel automated method for generating diverse, high-quality context-driven query-response data across multiple granularities to improve supervised fine-tuning of large language models, reducing reliance on costly human annotation.

Contribution

AugCon combines a recursive query generation approach with a contrastive learning scorer and self-improving techniques to produce diverse, high-fidelity training data for LLMs across various contexts.

Findings

01

AugCon outperforms state-of-the-art methods in diversity and quality of generated data.

02

Extensive evaluations confirm high fidelity and applicability across multiple benchmarks.

03

The method is effective for both English and Chinese datasets.

Abstract

Constructing high-quality query-response pairs from custom corpus is crucial for supervised fine-tuning (SFT) large language models (LLMs) in many applications, like creating domain-specific AI assistants or roleplaying agents. However, sourcing this data through human annotation is costly, and existing automated methods often fail to capture the diverse range of contextual granularity and tend to produce homogeneous data. To tackle these issues, we introduce a novel method named AugCon, capable of automatically generating context-driven SFT data across multiple levels of granularity with high diversity, quality and fidelity. AugCon begins by generating queries using the Context-Split-Tree (CST), an innovative approach for recursively deriving queries and splitting context to cover full granularity. Then, we train a scorer through contrastive learning to collaborate with CST to rank and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

quanshr/augcon
pytorchOfficial

Models

🤗
quanshr/Qwen-DailyM-32B-LoRA
model· 1 dl· ♡ 1
1 dl♡ 1

Videos

Automatically Generating Numerous Context-Driven SFT Data for LLMs Across Diverse Granularity· underline

Taxonomy

TopicsSemantic Web and Ontologies · Scientific Computing and Data Management

MethodsShrink and Fine-Tune · Contrastive Learning