# Eukan: a fully automated nuclear genome annotation pipeline for less studied and divergent eukaryotes

**Authors:** Matt Sarrasin, Gertraud Burger, B Franz Lang

PMC · DOI: 10.1093/nargab/lqag003 · NAR Genomics and Bioinformatics · 2026-01-20

## TL;DR

Eukan is a new automated pipeline for annotating nuclear genomes of diverse eukaryotes, improving accuracy by combining RNA-Seq data and multiple prediction sources.

## Contribution

Eukan introduces a novel annotation pipeline that consistently performs well on challenging genomes, including compact protist genomes.

## Key findings

- Eukan outperforms existing pipelines in handling compact protist genomes.
- Eukan recovers missing gene predictions with strong transcript support.
- A novel classification system for critical annotation defects is introduced.

## Abstract

Here, we introduce a new annotation pipeline, called Eukan, designed to deliver reliably high-quality results across a broad range of eukaryotes. First, experimental evidence is automatically leveraged to refine predictions, specifically, RNA-Seq coverage to inform generalized Hidden Markov Model gene prediction and intron lengths to inform protein sequence alignments. Second, a consensus is created from an empirically optimized weighting of gene predictions from multiple sources. Third, Eukan runs a post-annotation routine to recover gene predictions missing from the consensus that otherwise have strong transcript support and appear to be protein-coding. We compare the results of Eukan with those of three popular freely available pipelines (Maker, Braker, and Gemoma) on 17 phylogenetically diverse haploid and diploid nuclear genomes. In addition to the commonly reported annotation accuracy statistics, we define a novel classification system of critical defects commonly observed in automated annotations. Furthermore, we demonstrate that each of the tested pipelines correctly identified the majority of the validated “gold standard” genes across the test set, but each pipeline uniquely generates a non-negligible portion of either fragmented, artificially fused, or missing genes. Despite that, Eukan performs consistently well where other pipelines encounter challenges, such as for compact protist genomes.

## Full-text entities

- **Genes:** RNaseP:RNA (Ribonuclease P RNA) [NCBI Gene 3772418] {aka CR32868, Dm RPR, Dmel\CR32868, P RNA, RNAseP, RNase P RNA}
- **Chemicals:** diplonemids (-)
- **Species:** Mycosarcoma maydis (corn smut, species) [taxon 5270], Xenopus tropicalis (tropical clawed frog, species) [taxon 8364], C. elegans [taxon 328850], Toxoplasma gondii (species) [taxon 5811], Chlamydomonas reinhardtii (species) [taxon 3055], Aspergillus nidulans (species) [taxon 162425], Schizosaccharomyces pombe (fission yeast, species) [taxon 4896], Caenorhabditis elegans (species) [taxon 6239], Verticillium dahliae (species) [taxon 27337], Saccharomyces cerevisiae (baker's yeast, species) [taxon 4932], Cyanidioschyzon merolae (species) [taxon 45157], Thalassiosira pseudonana (species) [taxon 35128], Plicaturopsis crispa [taxon 139390], Medicago truncatula (barrel medic, species) [taxon 3880], Oryza sativa (Asian cultivated rice, species) [taxon 4530], Blastocystis (genus) [taxon 12967], Symbiodinium kawagutii (species) [taxon 104179], Ostreococcus sp. 'lucimarinus' (species) [taxon 242159], Drosophila melanogaster (fruit fly, species) [taxon 7227], Neurospora crassa (species) [taxon 5141], Dictyostelium discoideum (species) [taxon 44689], Bombus terrestris (buff-tailed bumblebee, species) [taxon 30195], Diplonema papillatum (species) [taxon 91374], Solanum lycopersicum (tomato, species) [taxon 4081], Caenorhabditis briggsae (species) [taxon 6238], Trypanosoma brucei (species) [taxon 5691], Danio rerio (leopard danio, species) [taxon 7955], Arabidopsis thaliana (mouse-ear cress, species) [taxon 3702], Chloropicon primus (species) [taxon 1764295], Homo sapiens (human, species) [taxon 9606], Leishmania major (species) [taxon 5664], Plasmodium falciparum (malaria parasite P. falciparum, species) [taxon 5833]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12817076/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12817076/full.md

## References

73 references — full list in the complete paper: https://tomesphere.com/paper/PMC12817076/full.md

---
Source: https://tomesphere.com/paper/PMC12817076