Structure-Aligned Protein Language Model
Can Chen, David Heurtel-Depeiges, Robert M. Vernon, Christopher James Langmead, Yoshua Bengio, Quentin Fournier

TL;DR
This paper introduces a dual-task framework that enriches protein language models with structural knowledge using graph neural networks and structure token prediction, leading to significant improvements in various biological tasks.
Contribution
The authors propose a novel method combining contrastive learning and structure token prediction to incorporate structural information into protein language models, enhancing their performance.
Findings
Improved performance in deep mutational scanning fitness prediction.
59% increase in contact prediction precision at L on CASP16.
Performance gains scale with model size and are consistent across tasks.
Abstract
Protein language models (pLMs) pre-trained on vast protein sequence databases excel at various downstream tasks but often lack the structural knowledge essential for some biological applications. To address this, we introduce a method to enrich pLMs with structural knowledge by leveraging pre-trained protein graph neural networks (pGNNs). First, a latent-level contrastive learning task aligns residue representations from pLMs with those from pGNNs across multiple proteins, injecting inter-protein structural information. Additionally, a physical-level task integrates intra-protein information by training pLMs to predict structure tokens. Together, the proposed dual-task framework effectively incorporates both inter- and intra-protein structural knowledge into pLMs. Given the variability in the quality of protein structures in PDB, we further introduce a residue loss selection module that…
Peer Reviews
Decision·Submitted to ICLR 2026
- The paper is well written and the methodology is clear. - The authors tackle an important problem of creating better protein sequence representations that account for the multimodal nature of proteins. - The specific method of computing the loss seems novel.
There are several methods that combine sequence and structure including contrastive learning e.g: [1] CCPL: Cross Modal Contrastive Protein Learning https://arxiv.org/abs/2303.11783, [2] S-PLM: Structure-aware Protein Language Model via Contrastive Learning between Sequence and Structure https://pubmed.ncbi.nlm.nih.gov/37609352/ [3] BioCLIP - Contrasting Sequence with Structure: Pre-training Graph Representations with PLMs https://www.biorxiv.org/content/10.1101/2023.12.01.569611v1 [4]
* Model architecture is well-motivated, the resolution-based loss is novel and ablations are thorough. * Improved PLMs have very high application potential.
* There is a related work which could bear including in the baseline [1] * The authors should explore a direct feature fusion approach. It is not clear what is left once even the dual loss is removed and is probably not the strongest form of this ablation. [2] 1. Chen, Dexiong, et al. "Endowing protein language models with structural knowledge." arXiv preprint arXiv:2401.14819 (2024). 2. Dai, Yimian, et al. "Attentional feature fusion." Proceedings of the IEEE/CVF winter conference on applica
Most claims are well-supported, particularly when it comes to the performance on the many downstream tasks which is consistent, albeit marginal. Overall the paper is easy to understand - the method is presented in a straightforward way,
Because of this marginal gain observed in the downstream tasks, subsamples could be used to obtain variances of the metric estimates to show that those increases are still substantial, and ideally statistically significant. Additionally, stating that the embeddings show better separation as a result of structural alignment based on UMAP projections need to be supplemented with more quantitative methods measuring separability. Also, Figure 2 needs to have some kind of statistical test to show the
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Bioinformatics · Bioinformatics and Genomic Networks · Genomics and Rare Diseases
MethodsContrastive Learning
