A New Pipeline For Generating Instruction Dataset via RAG and Self   Fine-Tuning

Chih-Wei Song; Yu-Kai Lee; Yin-Te Tsai

arXiv:2408.05911·cs.CL·August 13, 2024

A New Pipeline For Generating Instruction Dataset via RAG and Self Fine-Tuning

Chih-Wei Song, Yu-Kai Lee, Yin-Te Tsai

PDF

Open Access

TL;DR

This paper introduces a pipeline that uses retrieval-augmented generation and self fine-tuning to create high-quality, domain-specific instruction datasets from limited documents, enabling effective fine-tuning of large language models for specialized fields.

Contribution

The proposed pipeline automates domain-specific dataset creation using RAG and self fine-tuning, reducing manual effort and adapting quickly to domain updates, demonstrated in the psychiatry domain.

Findings

01

Successfully generated instruction datasets for psychiatry.

02

Fine-tuned LLMs showed improved domain relevance.

03

Pipeline adapts to document updates without retraining.

Abstract

With the rapid development of large language models in recent years, there has been an increasing demand for domain-specific Agents that can cater to the unique needs of enterprises and organizations. Unlike general models, which strive for broad coverage, these specialized Agents rely on focused datasets tailored to their intended applications. This research proposes a pipeline that leverages the power of LLMs and the Retrieval-Augmented Generation related framework to construct high-quality instruction datasets for fine-tuning on specific domains using custom document collections. By ingesting domain-specific documents, the pipeline generates relevant and contextually appropriate instructions, thus effectively creating a comprehensive dataset for fine-tuning LLMs on the target domain. This approach overcomes the limitations of traditional dataset creation methods, which often rely on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEducational Technology and Assessment

MethodsSparse Evolutionary Training