Data Pruning by Information Maximization
Haoru Tan, Sitong Wu, Wei Huang, Shizhen Zhao, Xiaojuan Qi

TL;DR
This paper introduces InfoMax, a scalable data pruning method that selects informative samples by maximizing information content and minimizing redundancy, improving model training efficiency across large datasets.
Contribution
The paper proposes a novel coreset selection algorithm, InfoMax, formalized as a discrete quadratic programming problem with an efficient gradient-based solver for large-scale data pruning.
Findings
Outperforms existing data pruning methods in various tasks
Effective on datasets with millions of samples
Enhances model training efficiency and performance
Abstract
In this paper, we present InfoMax, a novel data pruning method, also known as coreset selection, designed to maximize the information content of selected samples while minimizing redundancy. By doing so, InfoMax enhances the overall informativeness of the coreset. The information of individual samples is measured by importance scores, which capture their influence or difficulty in model learning. To quantify redundancy, we use pairwise sample similarities, based on the premise that similar samples contribute similarly to the learning process. We formalize the coreset selection problem as a discrete quadratic programming (DQP) task, with the objective of maximizing the total information content, represented as the sum of individual sample contributions minus the redundancies introduced by similar samples within the coreset. To ensure practical scalability, we introduce an efficient…
Peer Reviews
Decision·ICLR 2025 Poster
1. The paper presents an elegant and well-formulated approach to data pruning, with a solid theoretical foundation that supports its design. 2. The authors conduct a diverse set of experiments, including pretraining vision-language and fine-tuning LLM, further strengthening the validation of the method. 3. The performance of InfoMax is impressive, achieving high accuracy in various applications and outperforming existing state-of-the-art methods in many cases.
1. Some notations in the paper are unclear. For example, the symbols \( P \) on line 163 and \( z_n \) on line 1104 lack sufficient explanation. Furthermore, the variable \( X_t \) in lines 1054 to 1067 should be bolded for consistency. 2. The motivation behind InfoMax is not entirely novel, as the concepts of diversity and importance (information) have been previously discussed in [1, 2]. 3. The paper does not include comparisons with some relevant baseline methods, such as geometry-based metho
**Structure and Clarity:** - The work is well organised and presented clearly defining the problem statement hypothesis of the work. The core narrative and all technical contributions are written in a clear and concise manner, guiding most readers well to fully understand the contributions. - Most of the key concepts discussed are presented in the form of visualisations, or figures which help justify the narrative, and provide evidential basis of the investigations. **Method, hypothesis, findin
**Empirical Comparisons** - How does the method perform when compared to other data pruning methods not included in this work such as Sieve And Dyn-Unc. If there is a strong reason for not including these works then please correct me on this point. - The results in table 1 could be considered misleading with incorrect bolding of top results. For 70% cifar10 d2 is performing better, yet, infomax is highlighted. I assume this is a simple mistake. - Computational compression between methods is per
- InfoMax can effectively handle datasets with millions of samples within tens of minutes through sparsification techniques and dataset partitioning strategies. - InfoMax shows better performance compared to existing methods, especially under high pruning ratios. - InfoMax exhibits strong generalization capabilities across different datasets and tasks, including cross-model and cross-setting generalization.
- The Introduction in this paper lacks a high-level insight of the InfoMax to explain why it could work better than \(D^2\) Pruning, which makes the reader difficult to understand the motivation of the proposed pruning algorithm intuitively, for example, what is the more intuitive motivation of the proposed work to maintain a proper balance between importance and diversity? - In line#243, K_{z,s} should be inter-sample redundancy instead of intra-sample redundancy. - The symbolic sign used in th
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Visualization and Analytics · Data Mining Algorithms and Applications · Advanced Database Systems and Queries
