Scaling and Data Saturation in Protein Language Models
Aviv Spinner, Erika DeBenedictis, Corey M. Hudson

TL;DR
This study investigates how increasing the amount of training data affects the performance of protein language models, revealing continuous improvements without clear saturation and highlighting the importance of targeted data collection.
Contribution
It provides a comprehensive analysis of data scaling effects on protein language models across multiple years of UniRef100 data, emphasizing the need for targeted data strategies.
Findings
Performance improves with more data, but not monotonically.
Unsupervised models improve year-over-year but don't outperform supervised baselines.
No evidence of model performance saturation on protein function prediction.
Abstract
Data in biology is redundant, noisy, and sparse. How does the type and scale of available data impact model performance? In this work, we specifically investigate how protein language models (pLMs) scale with increasing pretraining data. We investigate this relationship by measuring the performance of protein function prediction on a suite of pLMs pretrained on yearly snapshots of UniRef100 from 2011 to 2024. We find no evidence of model saturation on this task: performance improves--but not monotonically--with added data, and this trend differs between unsupervised and supervised experiments. Using a well-characterized Beta-Lactamase protein from E. coli, we find that unsupervised model predictions get better year-over-year, though they do not yet consistently perform better than the supervised baseline. Our results underscore the need for targeted data acquisition and deeper study of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
