Automatic Pruning of Fine-tuning Datasets for Transformer-based Language Models
Mohammadreza Tayaranian, Seyyed Hasan Mozafari, Brett H. Meyer, James, J. Clark, Warren J. Gross

TL;DR
This paper introduces an automatic dataset pruning method for fine-tuning transformer-based language models, creating smaller, task-specific training subsets that maintain or slightly improve evaluation performance.
Contribution
It proposes a novel automatic pruning approach based on model success rates, generating optimized subsets for fine-tuning that outperform previous manual or feedback-based methods.
Findings
Winning ticket subsets are on average 3 times smaller than original datasets.
Fine-tuning on these subsets yields a 0.1% increase in evaluation accuracy.
The method is effective across multiple tasks and models.
Abstract
Transformer-based language models have shown state-of-the-art performance on a variety of natural language understanding tasks. To achieve this performance, these models are first pre-trained on general corpus and then fine-tuned on downstream tasks. Previous work studied the effect of pruning the training set of the downstream tasks on the performance of the model on its evaluation set. In this work, we propose an automatic dataset pruning method for the training set of fine-tuning tasks. Our method is based on the model's success rate in correctly classifying each training data point. Unlike previous work which relies on user feedback to determine subset size, our method automatically extracts training subsets that are adapted for each pair of model and fine-tuning task. Our method provides multiple subsets for use in dataset pruning that navigate the trade-off between subset size and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsDataset Pruning · Sparse Evolutionary Training · Pruning
