Improving Natural Language Understanding for LLMs via Large-Scale Instruction Synthesis
Lin Yuan, Jun Xu, Honghao Gui, Mengshu Sun, Zhiqiang Zhang, Lei Liang,, Jun Zhou

TL;DR
This paper introduces Hum, a large-scale synthetic instruction dataset for NLU tasks, created through human-LLM collaboration, significantly improving LLMs' NLU performance without harming general capabilities.
Contribution
The paper presents Hum, a diverse, high-quality instruction corpus for NLU, and a collaborative synthesis method to enhance LLMs' NLU abilities.
Findings
Hum improves NLU performance by 3.1% on average
Diverse instruction data maintains overall model capabilities
Collaborative synthesis enriches instruction quality and variety
Abstract
High-quality, large-scale instructions are crucial for aligning large language models (LLMs), however, there is a severe shortage of instruction in the field of natural language understanding (NLU). Previous works on constructing NLU instructions mainly focus on information extraction (IE), neglecting tasks such as machine reading comprehension, question answering, and text classification. Furthermore, the lack of diversity in the data has led to a decreased generalization ability of trained LLMs in other NLU tasks and a noticeable decline in the fundamental model's general capabilities. To address this issue, we propose Hum, a large-scale, high-quality synthetic instruction corpus for NLU tasks, designed to enhance the NLU capabilities of LLMs. Specifically, Hum includes IE (either close IE or open IE), machine reading comprehension, text classification, and instruction generalist…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Text Readability and Simplification · Intelligent Tutoring Systems and Adaptive Learning
