Beyond Efficiency: Molecular Data Pruning for Enhanced Generalization
Dingshuo Chen, Zhixun Li, Yuyan Ni, Guibin Zhang, Ding Wang, Qiang, Liu, Shu Wu, Jeffrey Xu Yu, Liang Wang

TL;DR
This paper introduces MolPeg, a novel source-free data pruning framework for molecular tasks that improves generalization and efficiency by selectively removing less informative data using pretrained models.
Contribution
MolPeg is a new data pruning method that operates without source data, leveraging dual-model loss discrepancy to enhance molecular transfer learning performance.
Findings
Outperforms existing data pruning methods across four molecular tasks.
Can surpass full-dataset training performance with 60-70% data pruning.
Enhances both efficiency and generalization in molecular transfer learning.
Abstract
With the emergence of various molecular tasks and massive datasets, how to perform efficient training has become an urgent yet under-explored issue in the area. Data pruning (DP), as an oft-stated approach to saving training burdens, filters out less influential samples to form a coreset for training. However, the increasing reliance on pretrained models for molecular tasks renders traditional in-domain DP methods incompatible. Therefore, we propose a Molecular data Pruning framework for enhanced Generalization (MolPeg), which focuses on the source-free data pruning scenario, where data pruning is applied with pretrained models. By maintaining two models with different updating paces during training, we introduce a novel scoring function to measure the informativeness of samples based on the loss discrepancy. As a plug-and-play framework, MolPeg realizes the perception of both source…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsGenetics, Bioinformatics, and Biomedical Research
MethodsPruning
