A New Pipeline For Generating Instruction Dataset via RAG and Self Fine-Tuning
Chih-Wei Song, Yu-Kai Lee, Yin-Te Tsai

TL;DR
This paper introduces a pipeline that uses retrieval-augmented generation and self fine-tuning to create high-quality, domain-specific instruction datasets from limited documents, enabling effective fine-tuning of large language models for specialized fields.
Contribution
The proposed pipeline automates domain-specific dataset creation using RAG and self fine-tuning, reducing manual effort and adapting quickly to domain updates, demonstrated in the psychiatry domain.
Findings
Successfully generated instruction datasets for psychiatry.
Fine-tuned LLMs showed improved domain relevance.
Pipeline adapts to document updates without retraining.
Abstract
With the rapid development of large language models in recent years, there has been an increasing demand for domain-specific Agents that can cater to the unique needs of enterprises and organizations. Unlike general models, which strive for broad coverage, these specialized Agents rely on focused datasets tailored to their intended applications. This research proposes a pipeline that leverages the power of LLMs and the Retrieval-Augmented Generation related framework to construct high-quality instruction datasets for fine-tuning on specific domains using custom document collections. By ingesting domain-specific documents, the pipeline generates relevant and contextually appropriate instructions, thus effectively creating a comprehensive dataset for fine-tuning LLMs on the target domain. This approach overcomes the limitations of traditional dataset creation methods, which often rely on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEducational Technology and Assessment
MethodsSparse Evolutionary Training
