FANNO: Augmenting High-Quality Instruction Data with Open-Sourced LLMs   Only

He Zhu; Junyou Su; Tianle Lun; Yicheng Tao; Wenjia Zhang; Zipei Fan,; Guanhua Chen

arXiv:2408.01323·cs.CL·August 5, 2024

FANNO: Augmenting High-Quality Instruction Data with Open-Sourced LLMs Only

He Zhu, Junyou Su, Tianle Lun, Yicheng Tao, Wenjia Zhang, Zipei Fan,, Guanhua Chen

PDF

Open Access

TL;DR

FANNO is an open-source framework that automates high-quality instruction data generation for LLMs, reducing costs and effort while achieving results comparable to human-annotated datasets.

Contribution

FANNO introduces a fully autonomous, open-source method for generating diverse instruction datasets using open LLMs, eliminating the need for manual annotation or proprietary API calls.

Findings

01

FANNO produces high-quality, diverse datasets comparable to human annotations.

02

FANNO reduces annotation costs and effort significantly.

03

Experiments show FANNO's data improves LLM performance on benchmarks.

Abstract

Instruction fine-tuning stands as a crucial advancement in leveraging large language models (LLMs) for enhanced task performance. However, the annotation of instruction datasets has traditionally been expensive and laborious, often relying on manual annotations or costly API calls of proprietary LLMs. To address these challenges, we introduce FANNO, a fully autonomous, open-sourced framework that revolutionizes the annotation process without the need for pre-existing annotated data. Utilizing a Mistral-7b-instruct model, FANNO efficiently produces diverse and high-quality datasets through a structured process involving document pre-screening, instruction generation, and response generation. Experiments on Open LLM Leaderboard and AlpacaEval benchmark show that the FANNO can generate high-quality data with diversity and complexity for free, comparable to human-annotated or cleaned…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMathematics, Computing, and Information Processing