Retrieval and competition: how a protein foundation model starts a protein
Piotr Jedryszek, Oliver M. Crook

TL;DR
This paper investigates how protein language models predict the start amino acid, revealing that predictions are based on retrieval of statistical signals rather than direct recognition of biological features.
Contribution
It uncovers the retrieval-based mechanism behind protein start predictions in language models and introduces a norm-direction decomposition to analyze positional encoding effects.
Findings
Models retrieve methionine signals from reference representations.
Predictions are influenced by a positional-prior retrieval circuit.
Biological divergence causes the model to predict methionine even when incorrect.
Abstract
Protein language models are increasingly used to guide experimental and clinical decisions, yet it is often unclear whether a confident prediction reflects recognition of biological evidence or retrieval of a statistical default. We examine this distinction for a near-universal biological rule, that proteins begin with methionine, by tracing the computational pathway through which ESM2-8M produces this prediction. The model does not detect methionine at the masked position. Instead, it retrieves a methionine-favouring signal from a reference representation at the beginning-of-sequence token via a position-specific query assembled across layers, with the final output emerging through competition with context-dependent circuits. To understand how positional information reaches the readout, we introduce a norm-direction decomposition of attention scores within rotary frequency bands.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
