Continued domain-specific pre-training of protein language models for pMHC-I binding prediction
Sergio E. Mares, Ariel Espinoza Weinberger, Nilah M. Ioannidis

TL;DR
This study investigates whether domain-specific continued pre-training of protein language models improves pMHC-I binding affinity prediction, especially for underrepresented alleles, by using MLM on HLA-associated peptides and fine-tuning on high-quality data.
Contribution
It demonstrates that continued pre-training of protein language models on HLA peptides enhances pMHC-I binding prediction accuracy, addressing data scarcity and allelic diversity challenges.
Findings
Improved prediction accuracy for underrepresented alleles.
Effective use of MLM-based continued pre-training on HLA peptides.
Avoidance of biases from mass spectrometry data.
Abstract
Predicting peptide--major histocompatibility complex I (pMHC-I) binding affinity remains challenging due to extreme allelic diversity (30,000 HLA alleles), severe data scarcity for most alleles, and noisy experimental measurements. Current methods particularly struggle with underrepresented alleles and quantitative binding prediction. We test whether domain-specific continued pre-training of protein language models is beneficial for their application to pMHC-I binding affinity prediction. Starting from ESM Cambrian (300M parameters), we perform masked-language modeling (MLM)-based continued pre-training on HLA-associated peptides (epitopes), testing two input formats: epitope sequences alone versus epitopes concatenated with HLA heavy chain sequences. We then fine-tune for functional IC binding affinity prediction using only high-quality quantitative data, avoiding mass…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Bioinformatics · Chemical Synthesis and Analysis · RNA and protein synthesis mechanisms
