In2Core: Leveraging Influence Functions for Coreset Selection in   Instruction Finetuning of Large Language Models

Ayrton San Joaquin; Bin Wang; Zhengyuan Liu; Nicholas Asher; Brian; Lim; Philippe Muller; Nancy F. Chen

arXiv:2408.03560·cs.LG·October 4, 2024

In2Core: Leveraging Influence Functions for Coreset Selection in Instruction Finetuning of Large Language Models

Ayrton San Joaquin, Bin Wang, Zhengyuan Liu, Nicholas Asher, Brian, Lim, Philippe Muller, Nancy F. Chen

PDF

Open Access

TL;DR

In2Core introduces an influence function-based coreset selection method that reduces fine-tuning data requirements for large language models by half, maintaining performance and improving interpretability.

Contribution

The paper presents a novel influence function-based algorithm for efficient coreset selection in instruction fine-tuning of LLMs, reducing data needs while preserving accuracy.

Findings

01

Achieves similar performance with 50% of training data

02

Provides interpretable signals on training set coverage of test samples

03

Reduces influence computation to fewer layers without loss of accuracy

Abstract

Despite advancements, fine-tuning Large Language Models (LLMs) remains costly due to the extensive parameter count and substantial data requirements for model generalization. Accessibility to computing resources remains a barrier for the open-source community. To address this challenge, we propose the In2Core algorithm, which selects a coreset by analyzing the correlation between training and evaluation samples with a trained model. Notably, we assess the model's internal gradients to estimate this relationship, aiming to rank the contribution of each training point. To enhance efficiency, we propose an optimization to compute influence functions with a reduced number of layers while achieving similar accuracy. By applying our algorithm to instruction fine-tuning data of LLMs, we can achieve similar performance with just 50% of the training data. Meantime, using influence functions to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis