Investigating Data Pruning for Pretraining Biological Foundation Models at Scale
Yifan Wu, Jiyue Jiang, Xichen Ye, Yiqi Wang, Chang Zhou, Yitao Xu, Jiayang Chen, He Hu, Weizhong Zhang, Cheng Jin, Jiao Yuan, Yu Li

TL;DR
This paper introduces an influence-guided data pruning method for biological foundation model pretraining, significantly reducing data and computational requirements while maintaining or improving model performance.
Contribution
It proposes a novel influence-based data pruning framework with two selection strategies, demonstrating effectiveness and generalizability across biological sequence tasks.
Findings
Over 99% data pruning still outperforms random baselines in RNA models.
Coresets outperform larger random subsets in RNA and protein tasks.
Significant redundancy exists in biological sequence datasets.
Abstract
Biological foundation models (BioFMs), pretrained on large-scale biological sequences, have recently shown strong potential in providing meaningful representations for diverse downstream bioinformatics tasks. However, such models often rely on millions to billions of training sequences and billions of parameters, resulting in prohibitive computational costs and significant barriers to reproducibility and accessibility, particularly for academic labs. To address these challenges, we investigate the feasibility of data pruning for BioFM pretraining and propose a post-hoc influence-guided data pruning framework tailored to biological domains. Our approach introduces a subset-based self-influence formulation that enables efficient estimation of sample importance at low computational cost, and builds upon it two simple yet effective selection strategies, namely Top-k Influence (Top I) and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMachine Learning in Bioinformatics · RNA and protein synthesis mechanisms · Genomics and Chromatin Dynamics
