Harnessing the Power of David against Goliath: Exploring Instruction   Data Generation without Using Closed-Source Models

Yue Wang; Xinrui Wang; Juntao Li; Jinxiong Chang; Qishen Zhang,; Zhongyi Liu; Guannan Zhang; Min Zhang

arXiv:2308.12711·cs.CL·August 25, 2023

Harnessing the Power of David against Goliath: Exploring Instruction Data Generation without Using Closed-Source Models

Yue Wang, Xinrui Wang, Juntao Li, Jinxiong Chang, Qishen Zhang,, Zhongyi Liu, Guannan Zhang, Min Zhang

PDF

Open Access

TL;DR

This paper investigates methods to generate high-quality instruction data for training large language models without relying on closed-source models, proposing novel strategies and demonstrating superior performance over existing approaches like Alpaca.

Contribution

The paper introduces new strategies for instruction data generation that do not depend on closed-source models and shows they outperform existing methods such as Alpaca.

Findings

01

Generated data outperforms Alpaca on benchmarks

02

Proposed strategies enhance instruction data quality

03

Effective alternative to closed-source model reliance

Abstract

Instruction tuning is instrumental in enabling Large Language Models~(LLMs) to follow user instructions to complete various open-domain tasks. The success of instruction tuning depends on the availability of high-quality instruction data. Owing to the exorbitant cost and substandard quality of human annotation, recent works have been deeply engaged in the exploration of the utilization of powerful closed-source models to generate instruction data automatically. However, these methods carry potential risks arising from the usage requirements of powerful closed-source models, which strictly forbid the utilization of their outputs to develop machine learning models. To deal with this problem, in this work, we explore alternative approaches to generate high-quality instruction data that do not rely on closed-source models. Our exploration includes an investigation of various existing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Machine Learning and Data Classification · Natural Language Processing Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dropout · Byte Pair Encoding · Adam · Position-Wise Feed-Forward Layer · Layer Normalization · Absolute Position Encodings · Residual Connection