CoPeP: Benchmarking Continual Pretraining for Protein Language Models
Darshan Patil, Pranshu Malviya, Mathieu Reymond, Quentin Fournier, Sarath Chandar

TL;DR
This paper introduces CoPeP, a benchmark for evaluating continual pretraining methods on protein language models using decade-spanning datasets, showing that temporal information improves model performance and that continual learning methods can outperform naive approaches.
Contribution
We propose the CoPeP benchmark for continual pretraining of protein language models, including datasets, metrics, and evaluation of various continual learning methods.
Findings
Temporal meta-information improves perplexity by up to 7%.
Several continual learning methods outperform naive pretraining.
The benchmark enables large-scale evaluation of continual learning in protein modeling.
Abstract
Protein language models (pLMs) have recently gained significant attention for their ability to uncover relationships between sequence, structure, and function from evolutionary statistics, thereby accelerating therapeutic drug discovery. These models learn from large protein databases that are continuously updated by the biology community and whose dynamic nature motivates the application of continual learning, not only to keep up with the ever-growing data, but also as an opportunity to take advantage of the temporal meta-information that is created during this process. As a result, we introduce the Continual Pretraining of Protein Language Models (CoPeP) benchmark, a novel benchmark for evaluating continual learning approaches on pLMs. Specifically, we curate a sequence of protein datasets derived from the UniProt Knowledgebase spanning a decade and define metrics to assess pLM…
Peer Reviews
Decision·Submitted to ICLR 2026
1. Overall, the paper is well written, with figures as visual illustrations. The Introduction section clearly explains the motivation behind the benchmark. It also makes a comparison to existing works and identifies their drawbacks. 2. Proposing a benchmark to capture the temporal and sequential nature of proteins is novel and significant to me. It is important to have such benchmark to advance research community and open future research to study evolutionary process of proteins. 3. Experiment
1. Usually when we do experiments, we encourage authors to repeat the same experimental setting multiple times and report both mean and standard deviation. However, this paper shows mean but not stddev, which is difficult for readers to judge how significantly the proposed method outperforms baselines. 2. Authors are suggested to provide more evaluation metrics to comprehensively test the proposed benchmark, such as property prediction for newly discovered proteins given previously known protei
The paper introduces a large-scale, realistic benchmark for continual pretraining in protein language models, moving beyond small synthetic datasets commonly used in continual learning research. It evaluates a diverse set of methods, demonstrates the utility of temporal metadata, and highlights performance gains of continual learning approaches over naive baselines. The benchmark is extensible as new UniProt releases become available, making it a good long-term resource for protein modeling comm
- Baseline fairness and clarity: The paper does not clearly explain how baseline numbers (AMPLIFY-1M, single-year training) were obtained or whether training budgets, data exposure, and deduplication policies were matched across methods. In particular, it is unclear which dataset AMPLIFY-1M was trained. - Potential underoptimization: The continual training setup shows performance gains simply from sequential exposure to data. This raises the concern that the 2015 baseline model used for comparis
- Try to address a timely and relevant topic: continual adaptation of pLMs as curated biological databases evolve over time. - Uses real yearly UniRef100 snapshots rather than synthetic domain shifts. - Temporal metadata exploitation (sequence persistence) is a novel and biologically meaningful idea. - Includes multiple continual learning strategies (replay, plasticity-preserving, unlearning) with consistent experimental protocol. - Transparent about engineering details.
1. The benchmark problem definition is incomplete and too narrowly evaluated. Evaluating a protein LM benchmark only on perplexity and a single mutational-fitness metric (ProteinGym) is far from sufficient. Real downstream utility of pLMs spans structure prediction, binding, stability, low-shot generalization, sequence design, etc. A benchmark should reflect that diversity, not just perplexity-like proxies. 2. The proposed notion of “continual learning” is too narrowly framed as purely temporal
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Bioinformatics · Domain Adaptation and Few-Shot Learning · Computational Drug Discovery Methods
