# OrthoML2GO: homology-based protein function prediction using orthogroups and machine learning

**Authors:** E.V. Malyugin, D.A. Afonnikov

PMC · DOI: 10.18699/vjgb-25-119 · Vavilov Journal of Genetics and Breeding · 2025-12-01

## TL;DR

OrthoML2GO improves protein function prediction by combining homology, orthogroups, and machine learning, outperforming existing tools on large and diverse datasets.

## Contribution

Introduces OrthoML2GO, a novel method integrating orthogroup analysis and machine learning for improved protein function prediction.

## Key findings

- OrthoML2GO outperforms Blast2GO and PANNZER2 in predicting protein functions, especially on large and heterogeneous samples.
- Combining orthogroup information with machine learning verification significantly improves annotation accuracy.
- The method shows high performance for large-scale automatic protein annotation.

## Abstract

In recent years, the rapid growth of sequencing data has exacerbated the problem of functional annotation of protein sequences, as traditional homology-based methods face limitations when working with distant homologs, making it difficult to accurately determine protein functions. This paper introduces the OrthoML2GO method for protein function prediction, which integrates homology searches using the USEARCH algorithm, orthogroup analysis based on OrthoDB version 12.0, and a machine learning algorithm (gradient boosting). A key feature of our approach is the use of orthogroup information to account for the evolutionary and functional similarity of proteins and the application of machine learning to refine the assigned GO terms for the target sequence. To select the optimal algorithm for protein annotation, the following approaches were applied sequentially: the k-nearest neighbors (KNN) method; a method based on the annotation of the orthogroup most represented in the k-nearest homologs (OG); a method of verifying the GO terms identified in the previous stage using machine learning algorithms. A comparison of the prediction accuracy of GO terms using the OrthoML2GO method with the Blast2GO and PANNZER2 annotation programs was performed on sequence samples from both individual organisms (humans, Arabidopsis) and a combined sample represented by different taxa. Our results demonstrate that the proposed method is comparable to, and by some evaluation metrics outperforms, these existing methods in terms of the quality of protein function prediction, especially on large and heterogeneous samples of organisms. The greatest performance improvement is achieved by combining information about the closest homologs and orthogroups with verification of terms using machine learning methods. Our approach demonstrates high performance for large-scale automatic protein annotation, and prospects for further development include optimizing machine learning model parameters for specific biological tasks and integrating additional sources of structural and functional information, which will further improve the method’s accuracy and versatility. In addition, the introduction of new bioinformatics tools and the expansion of the annotated protein database will contribute to the further improvement of the proposed approach.

## Linked entities

- **Species:** Homo sapiens (taxon 9606), Arabidopsis (taxon 3701)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12799360/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12799360/full.md

---
Source: https://tomesphere.com/paper/PMC12799360