CommonIT: Commonality-Aware Instruction Tuning for Large Language Models   via Data Partitions

Jun Rao; Xuebo Liu; Lian Lian; Shengjun Cheng; Yunjie Liao; Min Zhang

arXiv:2410.03077·cs.CL·October 7, 2024

CommonIT: Commonality-Aware Instruction Tuning for Large Language Models via Data Partitions

Jun Rao, Xuebo Liu, Lian Lian, Shengjun Cheng, Yunjie Liao, Min Zhang

PDF

Open Access 1 Repo 1 Video

TL;DR

CommonIT introduces a novel instruction tuning method that clusters datasets into groups based on task, embedding, and length, improving large language models' ability to follow instructions across domains.

Contribution

The paper proposes a new data sampling strategy for instruction tuning that enhances LLM performance by focusing on data similarity within training batches.

Findings

01

Improves instruction-following accuracy by 2.1% on general tasks.

02

Boosts performance by 5.2% on specialized domains.

03

Enhances task-specific capabilities across multiple models.

Abstract

With instruction tuning, Large Language Models (LLMs) can enhance their ability to adhere to commands. Diverging from most works focusing on data mixing, our study concentrates on enhancing the model's capabilities from the perspective of data sampling during training. Drawing inspiration from the human learning process, where it is generally easier to master solutions to similar topics through focused practice on a single type of topic, we introduce a novel instruction tuning strategy termed CommonIT: Commonality-aware Instruction Tuning. Specifically, we cluster instruction datasets into distinct groups with three proposed metrics (Task, Embedding and Length). We ensure each training mini-batch, or "partition", consists solely of data from a single group, which brings about both data randomness across mini-batches and intra-batch data similarity. Rigorous testing on LLaMa models…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

raojay7/commonit
pytorchOfficial

Videos

CommonIT: Commonality-Aware Instruction Tuning for Large Language Models via Data Partitions· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsLLaMA · BLOOM