# Multi-scale structural similarity embedding search across entire proteomes

**Authors:** Joan Segura, Ruben Sanchez-Garcia, Sebastian Bittrich, Yana Rose, Stephen K Burley, Jose M Duarte

PMC · DOI: 10.1093/bioinformatics/btag058 · Bioinformatics · 2026-02-03

## TL;DR

This paper introduces a scalable method for searching 3D biomolecular structures using AI, enabling efficient comparisons across large databases.

## Contribution

A novel structure similarity search method using protein language models and vector databases for large-scale 3D structure retrieval.

## Key findings

- The model generalizes beyond single domains to identify 3D similarity in full-length proteins and multimers.
- Vector databases enable efficient large-scale structure retrieval and comparison.
- The method is scalable for growing repositories of experimentally determined and AI-predicted structures.

## Abstract

The rapid expansion of three-dimensional (3D) biomolecular structure information, driven by breakthroughs in artificial intelligence/deep learning (AI/DL)-based structure predictions, has created an urgent need for scalable and efficient structure similarity search methods. Traditional alignment-based approaches, such as structural superposition tools, are computationally expensive and challenging to scale with the vast number of available macromolecular structures.

Herein, we present a scalable structure similarity search strategy designed to navigate extensive repositories of experimentally determined structures and computed structure models predicted using AI/DL methods. Our approach leverages protein language models and a deep neural network architecture to transform 3D structures into fixed-length vectors, enabling efficient large-scale comparisons. Although trained to predict TM-scores between single-domain structures, our model generalizes beyond the domain level, accurately identifying 3D similarity for full-length polypeptide chains and multimeric assemblies. By integrating vector databases, our method facilitates efficient large-scale structure retrieval, addressing the growing challenges posed by the expanding volume of 3D biostructure information.

Source code available at https://github.com/bioinsilico/rcsb-embedding-search. Source code DOI: https://doi.org/10.6084/m9.figshare.30546698.v1. Benchmark datasets DOI: https://doi.org/10.6084/m9.figshare.30546650.v1. Web server prototype available at: http://embedding-search.rcsb.org/.

## Full-text entities

- **Genes:** PDB [NCBI Gene 5131]
- **Diseases:** DL (MESH:C537113), Cancer (MESH:D009369), Diseases (MESH:D004194), CSMs (MESH:C000719218)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12955762/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12955762/full.md

## References

39 references — full list in the complete paper: https://tomesphere.com/paper/PMC12955762/full.md

---
Source: https://tomesphere.com/paper/PMC12955762