Automatic Pruning of Fine-tuning Datasets for Transformer-based Language   Models

Mohammadreza Tayaranian; Seyyed Hasan Mozafari; Brett H. Meyer; James; J. Clark; Warren J. Gross

arXiv:2407.08887·cs.CL·July 15, 2024

Automatic Pruning of Fine-tuning Datasets for Transformer-based Language Models

Mohammadreza Tayaranian, Seyyed Hasan Mozafari, Brett H. Meyer, James, J. Clark, Warren J. Gross

PDF

Open Access 1 Repo

TL;DR

This paper introduces an automatic dataset pruning method for fine-tuning transformer-based language models, creating smaller, task-specific training subsets that maintain or slightly improve evaluation performance.

Contribution

It proposes a novel automatic pruning approach based on model success rates, generating optimized subsets for fine-tuning that outperform previous manual or feedback-based methods.

Findings

01

Winning ticket subsets are on average 3 times smaller than original datasets.

02

Fine-tuning on these subsets yields a 0.1% increase in evaluation accuracy.

03

The method is effective across multiple tasks and models.

Abstract

Transformer-based language models have shown state-of-the-art performance on a variety of natural language understanding tasks. To achieve this performance, these models are first pre-trained on general corpus and then fine-tuned on downstream tasks. Previous work studied the effect of pruning the training set of the downstream tasks on the performance of the model on its evaluation set. In this work, we propose an automatic dataset pruning method for the training set of fine-tuning tasks. Our method is based on the model's success rate in correctly classifying each training data point. Unlike previous work which relies on user feedback to determine subset size, our method automatically extracts training subsets that are adapted for each pair of model and fine-tuning task. Our method provides multiple subsets for use in dataset pruning that navigate the trade-off between subset size and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mthcom/hscore-dataset-pruning
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsDataset Pruning · Sparse Evolutionary Training · Pruning