# Masked language modeling pretraining dynamics for downstream peptide: T-cell receptor binding prediction

**Authors:** Brock Landry, Jian Zhang

PMC · DOI: 10.1093/bioadv/vbaf028 · Bioinformatics Advances · 2025-02-20

## TL;DR

This paper studies how pretraining with masked language modeling affects the ability to predict peptide:TCR binding, showing that performance peaks before pretraining loss converges.

## Contribution

It reveals that pretraining loss is not a reliable indicator of downstream performance and identifies a threshold beyond which pretraining offers no benefit.

## Key findings

- Downstream performance peaks before pretraining loss converges.
- Pretraining loss can indicate when downstream benefits plateau.
- Post-threshold pretraining causes unpredictable performance deviations.

## Abstract

Predicting antigen peptide and T-cell receptor (TCR) binding is difficult due to the combinatoric nature of peptides and the scarcity of labeled peptide-binding pairs. The masked language modeling method of pretraining is reliably used to increase the downstream performance of peptide:TCR binding prediction models by leveraging unlabeled data. In the literature, binding prediction models are commonly trained until the validation loss converges. To evaluate this method, cited transformer model architectures pretrained with masked language modeling are investigated to assess the benefits of achieving lower loss metrics during pretraining. The downstream performance metrics for these works are recorded after each subsequent interval of masked language modeling pretraining.

The results demonstrate that the downstream performance benefit achieved from masked language modeling peaks substantially before the pretraining loss converges. Using the pretraining loss metric is largely ineffective for precisely identifying the best downstream performing pretrained model checkpoints (or saved states). However, the pretraining loss metric in these scenarios can be used to mark a threshold in which the downstream performance benefits from pretraining have fully diminished. Further pretraining beyond this threshold does not negatively impact downstream performance but results in unpredictable bilateral deviations from the post-threshold average downstream performance benefit.

The datasets used in this article for model training are publicly available from each original model’s authors at https://github.com/SFGLab/bertrand, https://github.com/wukevin/tcr-bert, https://github.com/NKI-AI/STAPLER, and https://github.com/barthelemymp/TULIP-TCR.

## Full-text entities

- **Genes:** TRBV20OR9-2 (T cell receptor beta variable 20/OR9-2 (non-functional)) [NCBI Gene 6962] {aka CDR3, TCRBV20S2, TCRBV2O, TCRBV2S2O}

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC11908642/full.md

## Figures

2 figures with captions in the complete paper: https://tomesphere.com/paper/PMC11908642/full.md

## References

23 references — full list in the complete paper: https://tomesphere.com/paper/PMC11908642/full.md

---
Source: https://tomesphere.com/paper/PMC11908642