# Machine learning can distinguish orphans that have resulted from sequence divergence beyond recognition

**Authors:** Emilios Tassios, Jori de Leuw, Christoforos Nikolaou, Anne Kupczok, Nikolaos Vakirlis

PMC · DOI: 10.1093/bioadv/vbaf324 · Bioinformatics Advances · 2025-12-27

## TL;DR

This paper shows that machine learning can identify orphan genes that have diverged too much to be recognized by traditional methods.

## Contribution

A novel machine learning approach to distinguish divergent orphans from de novo orphans using similarity search patterns.

## Key findings

- Machine learning models achieved ∼90% accuracy for moderate divergence and ∼70% for extreme divergence.
- About 30% of real orphans were predicted to be divergent, and these genes were shorter and more disordered.
- Non-statistically significant similarity hits can be informative for identifying divergent orphans.

## Abstract

Species-specific orphan genes lack homologues outside of a given taxon and frequently underlie unique species traits. Orphans can result from sequence divergence beyond recognition, when homologous proteins diverge to an extent at which sequence similarity search algorithms can no longer identify them as homologues, but they can also evolve de novo from previously noncoding sequences, in which case homologous protein-coding genes truly do not exist.

Here we propose that sequence divergent orphans might be recognizable from their patterns of non-statistically significant similarity hits which are typically discarded. To test this, we simulated diverged orphan protein sequences under varying parameters. Using reversed protein sequences as negative control, we trained machine learning classifiers on features extracted from similarity search output. We found that this approach works, but performance of the models depends on the simulation parameters, with ∼90% accuracy when the underlying simulated divergence was moderate and ∼70% when it is extreme. When applying our classifiers on a set of real orphans we found that ∼30% of them are predicted to be divergent and these are shorter and more disordered than the rest. Our work contributes to the effort of better understanding how genetic novelty arises.

The models and data used can be found at https://github.com/emiliostassios/Classification-of-divergent-genes-using-ML

## Full-text entities

- **Species:** Saccharomyces cerevisiae (baker's yeast, species) [taxon 4932], Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12904771/full.md

## Figures

3 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12904771/full.md

## References

57 references — full list in the complete paper: https://tomesphere.com/paper/PMC12904771/full.md

---
Source: https://tomesphere.com/paper/PMC12904771