CRAFT: Clustered Regression for Adaptive Filtering of Training data
Parthasarathi Panda, Asheswari Swain, Subhrakanta Panda

TL;DR
CRAFT is a fast, vectorization-agnostic method for selecting high-quality training data for sequence-to-sequence models, improving translation performance efficiently from large corpora.
Contribution
It introduces a novel two-stage clustering-based selection method with theoretical bounds, significantly speeding up data filtering for large-scale training datasets.
Findings
CRAFT outperforms TSDS with 43.34 BLEU versus 41.21.
CRAFT completes selection over 40 times faster than comparable methods.
The pipeline runs in under one minute on CPU with TF-IDF vectorization.
Abstract
Selecting a small, high-quality subset from a large corpus for fine-tuning is increasingly important as corpora grow to tens of millions of datapoints, making full fine-tuning expensive and often unnecessary. We propose CRAFT (Clustered Regression for Adaptive Filtering of Training data), a vectorization-agnostic selection method for training sequence-to-sequence models. CRAFT decomposes the joint source-target distribution and performs a two-stage selection: (i) match the validation source distribution through proportional budget allocation across k-means clusters, and (ii) within each source cluster, select training pairs whose target embeddings minimize a conditional expected distance derived from the validation target distribution. We prove that proportional cluster allocation bounds the continuous KL divergence between selected and validation distributions, with the residual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
