Swift Cross-Dataset Pruning: Enhancing Fine-Tuning Efficiency in Natural Language Understanding
Binh-Nguyen Nguyen, Yang He

TL;DR
This paper introduces Swift Cross-Dataset Pruning, a fast and efficient method for selecting representative data subsets to improve fine-tuning in NLP tasks across diverse datasets, reducing computational costs.
Contribution
The paper proposes a novel pruning technique using TF-IDF embeddings and geometric median to efficiently select data for cross-dataset NLP fine-tuning, addressing variability and computational challenges.
Findings
Significantly reduces training data and computational resources.
Effective across multiple NLP tasks and dataset scales.
Maintains or improves model performance with fewer samples.
Abstract
Dataset pruning aims to select a subset of a dataset for efficient model training. While data efficiency in natural language processing has primarily focused on within-corpus scenarios during model pre-training, efficient dataset pruning for task-specific fine-tuning across diverse datasets remains challenging due to variability in dataset sizes, data distributions, class imbalance and label spaces. Current cross-dataset pruning techniques for fine-tuning often rely on computationally expensive sample ranking processes, typically requiring full dataset training or reference models. We address this gap by proposing Swift Cross-Dataset Pruning (SCDP). Specifically, our approach uses TF-IDF embeddings with geometric median to rapidly evaluate sample importance. We then apply dataset size-adaptive pruning to ensure diversity: for smaller datasets, we retain samples far from the geometric…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsDataset Pruning · Pruning
