Transformers trained on proteins can learn to attend to Euclidean distance

Isaac Ellmen; Constantin Schneider; Matthew I.J. Raybould; Charlotte M. Deane

arXiv:2502.01533·cs.LG·August 4, 2025

Transformers trained on proteins can learn to attend to Euclidean distance

Isaac Ellmen, Constantin Schneider, Matthew I.J. Raybould, Charlotte M. Deane

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper demonstrates that standard Transformer models trained on protein sequences can independently learn to attend to Euclidean distances in 3D space, enhancing protein structure modeling without specialized structural architectures.

Contribution

It provides a theoretical framework and empirical validation that Transformers can learn structural information directly from coordinate embeddings, enabling hybrid structure-language modeling.

Findings

01

Transformers can learn to filter attention as a 3D Gaussian with learned variance.

02

Pre-training on structural data improves downstream protein modeling tasks.

03

Transformers outperform custom structural models when trained with structural information.

Abstract

While conventional Transformers generally operate on sequence data, they can be used in conjunction with structure models, typically SE(3)-invariant or equivariant graph neural networks (GNNs), for 3D applications such as protein structure modelling. These hybrids typically involve either (1) preprocessing/tokenizing structural features as input for Transformers or (2) taking Transformer embeddings and processing them within a structural representation. However, there is evidence that Transformers can learn to process structural information on their own, such as the AlphaFold3 structural diffusion model. In this work we show that Transformers can function independently as structure models when passed linear embeddings of coordinates. We first provide a theoretical explanation for how Transformers can learn to filter attention as a 3D Gaussian with learned variance. We then validate this…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 8Confidence 3

Strengths

This paper seems to bring an original contribution, by demonstrating that Transformers can handle 3D structural reasoning on their own. This capability challenges the reliance on dedicated modules for spatial tasks that enforce symmetries and equivariance and introduces a new approach to protein structure modeling and other applications involving 3D data. The Authors provide a sound theoretical explanation of how Transformers can approximate Gaussian distance filters making use of coordinates

Weaknesses

The quality of some of the plots is not excellent. For example, in Fig. 4 a-d plots are not very readable (tick labels are essentially invisible), and maybe a different strategy to convey the relevant information could be implemented.

Reviewer 02Rating 5Confidence 3

Strengths

Protein language models that operate on amino acid sequences are of significant interest for generative design, among other downstream tasks. Likewise protein structure prediction, as in the recent Nobel Prize-winning work of AlphaFold*, is of central importance in biochemistry. This work aims to demonstrate that protein language models themselves have some structural biology capabilities, which also helps with downstream tasks such as protein prediction; this provides significance. The paper

Weaknesses

A swath of past work on the interpretability of protein language models, going back to Vig et al. in ICLR 2021 "BERTology meets biology" also show that attention mechanisms capture the folding structure of proteins, connecting amino acids that are far apart in the underlying sequence, but spatially close in the three-dimensional structure (among other results(. See also references thereto. It is not clear how much more extra novelty this paper provides beyond this existing line of literature,

Reviewer 03Rating 6Confidence 4

Strengths

1. The main idea of a paper is, indeed, interesting. It turns out that to some degree transformers without explicit SE(3) invariance can learn SE(3) invariant functions. 2. Experiment which confirm (1) are provided.

Weaknesses

1. To which degree Equations 2,4 do hold? That is, what is the order of omitted terms? 2. Notation is ambiguous. Sometimes "x" is a vector, sometimes it isn't. For example, A1, A3. 3. I don't understand the purpose of a Section 3.2.3 "PROTEIN FUNCTION PREDICTION". Using embeddings from pre-trained networks for protein property prediction is an established practice. 4. A very natural experiment is missing. You can take a pre-trained transformer which is presumable SE(3) invariant and shift/rotate

Code & Models

Repositories

Ellmen/attending-to-distance
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Bioinformatics · Genetics, Bioinformatics, and Biomedical Research

MethodsAttention Is All You Need · Linear Layer · Dense Connections · Residual Connection · Multi-Head Attention · Diffusion · Position-Wise Feed-Forward Layer · Adam · Softmax · Dropout