# Trainable subnetworks reveal insights into structure knowledge organization in protein language models

**Authors:** Ria Vinod, Ava P. Amini, Lorin Crawford, Kevin K. Yang, Nir Ben-Tal, Rachel Kolodny, Nir Ben-Tal, Rachel Kolodny, Nir Ben-Tal, Rachel Kolodny

PMC · DOI: 10.1371/journal.pcbi.1013925 · PLOS Computational Biology · 2026-02-09

## TL;DR

This paper introduces a method to study how protein structure information is encoded in language models by isolating subnetworks tied to specific structural categories.

## Contribution

The novel contribution is the development of trainable subnetworks to dissect how structural knowledge is distributed in protein language model weights.

## Key findings

- PLMs are highly sensitive to sequence-level features and can disentangle coarse or fine-grained structural information.
- Structure prediction accuracy is highly responsive to factorized PLM representations.
- Small changes in language modeling performance can significantly impact structure prediction capabilities.

## Abstract

Protein language models (PLMs) pretrained via a masked language modeling objective have proven effective across a range of structure-related tasks, including high-resolution structure prediction. However, it remains unclear to what extent these models factorize protein structural categories among their learned parameters. In this work, we introduce trainable subnetworks, which mask out the PLM weights responsible for language modeling performance on a structural category of proteins. We systematically trained 39 PLM subnetworks targeting both sequence- and residue-level features at varying degrees of resolution using annotations defined by the CATH taxonomy and secondary structure elements. Using these PLM subnetworks, we assessed how structural factorization in PLMs influences downstream structure prediction. Our results show that PLMs are highly sensitive to sequence-level features and can predominantly disentangle extremely coarse or fine-grained information. Furthermore, we observe that structure prediction is highly responsive to factorized PLM representations and that small changes in language modeling performance can significantly impair PLM-based structure prediction capabilities. Our work presents a framework for studying feature entanglement within pretrained PLMs and can be leveraged to improve the alignment of learned PLM representations with known biological concepts.

Proteins govern cellular processes and their functions arise from the three-dimensional structures encoded by their amino acid sequences. Predicting protein structure from sequence has thus become a central capability of modern biological sequence models. Protein language models, trained on sequence alone with a general language modeling objective, are remarkably accurate at structure prediction and are widely used in protein design and engineering workflows.

However, relatively little is known about how these models’ weights encode relationships between different protein structural features. This direction is increasingly important as protein language models scale in data, compute, and model size. Here, we demonstrate that it is possible to isolate subsets of model weights, i.e., subnetworks, that correspond to specific categories of defined structures. Our results show that the structure-prediction accuracy using protein language models is highly sensitive to these subnetworks, even when changes in language modeling performance are small. When applied across diverse structural categories, our method suggests that structural knowledge is distributed in a way that reflects the continuous spectrum of protein structural diversity. Our work provides insight into how biologically relevant information is organized within protein language model weights and offers a foundation for a more informed and interpretable way to train future models.

## Full-text entities

- **Diseases:** PLMs (MESH:D007806), ESM-2 (MESH:D020803)
- **Chemicals:** Anita Estes (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]
- **Cell lines:** ESM-2 — Homo sapiens (Human), Transformed cell line (CVCL_XI05)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12928587/full.md

## Figures

3 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12928587/full.md

## References

39 references — full list in the complete paper: https://tomesphere.com/paper/PMC12928587/full.md

---
Source: https://tomesphere.com/paper/PMC12928587