Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for   Language Models

Haoran Li; Qingxiu Dong; Zhengyang Tang; Chaojun Wang; Xingxing Zhang,; Haoyang Huang; Shaohan Huang; Xiaolong Huang; Zeqiang Huang; Dongdong Zhang,; Yuxian Gu; Xin Cheng; Xun Wang; Si-Qing Chen; Li Dong; Wei Lu; Zhifang Sui,; Benyou Wang; Wai Lam; Furu Wei

arXiv:2402.13064·cs.CL·February 21, 2024·5 cites

Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models

Haoran Li, Qingxiu Dong, Zhengyang Tang, Chaojun Wang, Xingxing Zhang,, Haoyang Huang, Shaohan Huang, Xiaolong Huang, Zeqiang Huang, Dongdong Zhang,, Yuxian Gu, Xin Cheng, Xun Wang, Si-Qing Chen, Li Dong, Wei Lu, Zhifang Sui,, Benyou Wang, Wai Lam, Furu Wei

PDF

Open Access 1 Models 4 Datasets

TL;DR

This paper presents GLAN, a scalable method for instruction tuning of large language models using a human knowledge taxonomy to generate synthetic training data, improving performance across diverse tasks without task-specific data.

Contribution

Introduces a novel generalized instruction tuning framework that leverages a structured human knowledge taxonomy to generate synthetic data for large language models.

Findings

01

GLAN improves performance on mathematical reasoning, coding, and academic exams.

02

It enables easy customization for new fields or skills.

03

Demonstrates effectiveness without task-specific training data.

Abstract

We introduce Generalized Instruction Tuning (called GLAN), a general and scalable method for instruction tuning of Large Language Models (LLMs). Unlike prior work that relies on seed examples or existing datasets to construct instruction tuning data, GLAN exclusively utilizes a pre-curated taxonomy of human knowledge and capabilities as input and generates large-scale synthetic instruction data across all disciplines. Specifically, inspired by the systematic structure in human education system, we build the taxonomy by decomposing human knowledge and capabilities to various fields, sub-fields and ultimately, distinct disciplines semi-automatically, facilitated by LLMs. Subsequently, we generate a comprehensive list of subjects for every discipline and proceed to design a syllabus tailored to each subject, again utilizing LLMs. With the fine-grained key concepts detailed in every class…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
xTRam1/safe-guard-classifier
model· 29 dl
29 dl

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis