Improving Natural Language Understanding for LLMs via Large-Scale   Instruction Synthesis

Lin Yuan; Jun Xu; Honghao Gui; Mengshu Sun; Zhiqiang Zhang; Lei Liang,; Jun Zhou

arXiv:2502.03843·cs.CL·February 7, 2025

Improving Natural Language Understanding for LLMs via Large-Scale Instruction Synthesis

Lin Yuan, Jun Xu, Honghao Gui, Mengshu Sun, Zhiqiang Zhang, Lei Liang,, Jun Zhou

PDF

Open Access 1 Video

TL;DR

This paper introduces Hum, a large-scale synthetic instruction dataset for NLU tasks, created through human-LLM collaboration, significantly improving LLMs' NLU performance without harming general capabilities.

Contribution

The paper presents Hum, a diverse, high-quality instruction corpus for NLU, and a collaborative synthesis method to enhance LLMs' NLU abilities.

Findings

01

Hum improves NLU performance by 3.1% on average

02

Diverse instruction data maintains overall model capabilities

03

Collaborative synthesis enriches instruction quality and variety

Abstract

High-quality, large-scale instructions are crucial for aligning large language models (LLMs), however, there is a severe shortage of instruction in the field of natural language understanding (NLU). Previous works on constructing NLU instructions mainly focus on information extraction (IE), neglecting tasks such as machine reading comprehension, question answering, and text classification. Furthermore, the lack of diversity in the data has led to a decreased generalization ability of trained LLMs in other NLU tasks and a noticeable decline in the fundamental model's general capabilities. To address this issue, we propose Hum, a large-scale, high-quality synthetic instruction corpus for NLU tasks, designed to enhance the NLU capabilities of LLMs. Specifically, Hum includes IE (either close IE or open IE), machine reading comprehension, text classification, and instruction generalist…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Improving Natural Language Understanding for LLMs via Large-Scale Instruction Synthesis· underline

Taxonomy

TopicsNatural Language Processing Techniques · Text Readability and Simplification · Intelligent Tutoring Systems and Adaptive Learning