# Exploring homology detection via k-means clustering of proteins embedded with a large language model

**Authors:** Thomas Minotto, Antoine Claessens, Thomas D Otto

PMC · DOI: 10.1093/bioinformatics/btaf472 · Bioinformatics · 2025-08-26

## TL;DR

This paper explores using a large language model and k-means clustering to detect protein homology, achieving better precision in identifying n:m orthologs.

## Contribution

The novel contribution is applying a biologically oriented large language model with k-means clustering to detect protein homology relationships.

## Key findings

- The approach achieves better precision for detecting n:m orthologs compared to other tools.
- Full orthologous groups are successfully reconstructed from scratch using the method.
- Large language models combined with clustering show potential for analyzing protein data.

## Abstract

Inferring protein homology from sequence information is essential for understanding species evolution and enabling functional annotation transfer. Besides similarity-based methods, several machine learning approaches have been developed using various ways of representing protein data.

Here, we represent proteins with a biologically oriented large language model and apply k-means clustering to the embedded data to extract homology relationships. Although our approach lacks the sensitivity of other tools, we obtain better precision for the detection of n:m orthologs. Furthermore, we successfully reconstruct full orthologous groups from scratch, highlighting the growing potential of using large language models in combination with clustering algorithms for the analysis of protein data.

Datasets are available on OrthoMCL-DB as indicated in the Methods. Source code is available on GitHub at https://github.com/ThomasGTHB/OrthoLM and Zenodo at https://doi.org/10.5281/zenodo.16640170.

## Full-text entities

- **Genes:** NCAN (neurocan) [NCBI Gene 1463] {aka CSPG3}, CESA2 (cellulose synthase A2) [NCBI Gene 830090] {aka ATCESA2, ATH-A, CELLULOSE SYNTHASE, T22F8.250, T22F8_250, cellulose synthase A2}, MSH6 (mutS homolog 6) [NCBI Gene 2956] {aka GTBP, GTMBP, HNPCC5, HSAP, LYNCH5, MMRCS3}
- **Diseases:** malaria (MESH:D008288)
- **Chemicals:** amino (-), amino acid (MESH:D000596)
- **Species:** Toxoplasma gondii (species) [taxon 5811], Xenopus tropicalis (tropical clawed frog, species) [taxon 8364], Trypanosoma brucei (species) [taxon 5691], Pan troglodytes (chimpanzee, species) [taxon 9598], Cryptosporidium parvum (species) [taxon 5807], Arabidopsis thaliana (mouse-ear cress, species) [taxon 3702], Danio rerio (leopard danio, species) [taxon 7955], Sulfolobus acidocaldarius (species) [taxon 2285], Homo sapiens (human, species) [taxon 9606], Mycobacterium tuberculosis (species) [taxon 1773], Plasmodium berghei (species) [taxon 5821], Cryptosporidium bovis (species) [taxon 310047], Escherichia coli (E. coli, species) [taxon 562], Drosophila melanogaster (fruit fly, species) [taxon 7227], Mus musculus (house mouse, species) [taxon 10090], Plasmodium falciparum (malaria parasite P. falciparum, species) [taxon 5833], Caenorhabditis elegans (species) [taxon 6239], Neospora caninum (species) [taxon 29176], Saccharomyces cerevisiae (baker's yeast, species) [taxon 4932]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12517335/full.md

## Figures

6 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12517335/full.md

## References

28 references — full list in the complete paper: https://tomesphere.com/paper/PMC12517335/full.md

---
Source: https://tomesphere.com/paper/PMC12517335