ZIP-FIT: Embedding-Free Data Selection via Compression-Based Alignment
Elyas Obbad, Iddah Mlauzi, Brando Miranda, Rylan Schaeffer, Kamal, Obbad, Suhana Bedi, Sanmi Koyejo

TL;DR
ZIP-FIT introduces a compression-based data selection method that effectively aligns training data with target tasks, leading to faster and more efficient language model training for specialized tasks like Autoformalization and code generation.
Contribution
The paper presents ZIP-FIT, a novel data selection framework using gzip compression to measure task alignment, outperforming existing methods in efficiency and effectiveness.
Findings
ZIP-FIT achieves up to 85.1% faster training convergence.
ZIP-FIT selection is up to 65.8% faster than DSIR.
Small, well-aligned datasets outperform larger, less targeted ones.
Abstract
Data selection is crucial for optimizing language model (LM) performance on specific tasks, yet most existing methods fail to effectively consider the target task distribution. Current approaches either ignore task-specific requirements entirely or rely on approximations that fail to capture the nuanced patterns needed for tasks like Autoformalization or code generation. Methods that do consider the target distribution often rely on simplistic, sometimes noisy, representations, like hashed n-gram features, which can lead to collisions and introduce noise. We introduce ZIP-FIT, a data selection framework that uses gzip compression to directly measure alignment between potential training data and the target task distribution. In extensive evaluations on Autoformalization and Python code generation, ZIP-FIT significantly outperforms leading baselines like DSIR and D4. Models…
Peer Reviews
Decision·Submitted to ICLR 2025
The paper is concise, sound, well written, and the experimental section shows promise for the method, especially with regard to other embedding-free methods. The conceptual simplicity combined with the empirical results of the method is an especially strong point of the work.
Ideally, it would be shown how the size of $n$ (i.e., number of samples from the target domain $p$) influences the performance of the method. If it is possible to pick $n$ just sufficiently large enough, it would greatly improve the computational efficiency of the method for large target datasets. Experiments in other domains would be really nice to better demonstrate the generalization capabilities of the method. Possibly there is data that is not well-suited to compression and accordingly ZIP
1. ZIP-FIT’s embedding-free approach is a refreshing deviation from common embedding-based methods, offering a novel solution by leveraging gzip compression. The concept of using normalized compression distance (NCD) as an alignment metric is insightful and could inspire future research in embedding-free methodologies for various data selection tasks. 2. The empirical results support the claims, showing that ZIP-FIT achieves faster convergence and better performance than established methods. The
1. While ZIP-FIT achieves excellent results on the tasks tested, its reliance on gzip compression may limit its effectiveness in complex semantic domains where relationships are nuanced and less compressible. Embedding-free approaches, while efficient, may not be ideal for tasks that require deep semantic understanding or complex syntactic relationships.
- This paper is well-presented and well-motivated. - Studying computation-efficient methods for data selection in LLM instruction fine-tuning is a promising research direction. - The proposed ZIP-FIT is intuitive and easy to follow. - The proposed approach bypasses the need for LLM forward computation to obtain embeddings, making it computationally efficient. - The presented experimental results seem promising.
- [Major] The proposed method seems very simple and straightforward; using a gzip-style method to embed data appears to be a relatively standard approach. - [Major] All experimental results are based on test loss, which may not be very reliable. It would be essential to conduct evaluations on some standard benchmarks, such as HumanEval and MBPP for code evaluation, to demonstrate the scores the model can achieve. - It is unclear how the proposed ZIP-FIT compares to prior, more complex data selec
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Neural Networks and Applications
