# Widespread gene fusion artifacts in helminth genome annotations

**Authors:** Emma L. Collington, Andrew C. Doxey, Brendan J. McConkey, D. Moira Glerum

PMC · DOI: 10.1186/s12864-026-12589-y · 2026-02-04

## TL;DR

This study finds that many predicted fusion genes in helminth genomes are likely artifacts, not real genes, based on RNA-seq data analysis.

## Contribution

The study identifies that most helminth-specific fusion genes are likely gene prediction artifacts, not true fusions.

## Key findings

- Helminth-specific fusion genes show no RNA-seq expression correlation between fused domains.
- These genes have longer interdomain regions and less RNA-seq coverage continuity.
- They are not supported in de novo transcriptome assemblies, suggesting annotation errors.

## Abstract

Current helminth genomes possess thousands of predicted fusion genes, encoding novel protein domain architectures that are unique to these species. To investigate this, we analyzed 20,313 two-domain proteins annotated in current helminth genomes, of which 10,297 are apparently unique to helminths, and used RNA-seq data from 20 species of helminth to examine their plausibility as true fusion genes. For comparison, we analyzed a set of 400 high confidence, evolutionarily conserved domain fusions that are present in both helminth and non-helminth species.

Our analysis suggests that, in contrast to genuine fusion genes, the majority of helminth-specific fusion genes in the 20 species investigated are likely gene prediction artifacts based on several criteria: (1) they show a lack of correlation between RNA-seq derived expression levels of the first and second “fused” domains, as well as the interdomain region; (2) they have significantly longer interdomain regions; (3) there is significantly less continuity of coverage in their interdomain regions consistent with breakpoints in RNA-seq coverage; and (4) they are generally not supported in de novo transcriptome assemblies.

Proteins containing novel domain combinations have been included in widely used sequence and protein databases, including WormBase ParaSite and InterPro, but the analyses presented here suggest that many helminth-specific domain fusion proteins are erroneously annotated. These findings emphasize the importance of using RNA-seq data to validate gene predictions in helminth genomes, especially those with unique structures not observed in other species. Given the increasing need to accurately identify helminth-specific proteins as therapeutic targets, the accuracy of proteome annotation in widely used genomic databases is essential.

The online version contains supplementary material available at 10.1186/s12864-026-12589-y.

## Full-text entities

- **Genes:** cox-17 (Cytochrome c oxidase copper chaperone) [NCBI Gene 175186], cox-5B (COX5B-domain-containing protein) [NCBI Gene 172832], cox-15 (EOG090X04TT) [NCBI Gene 174712], cox-11 (Cytochrome c oxidase assembly protein COX11, mitochondrial) [NCBI Gene 186814], sco-1 (Thioredoxin domain-containing protein) [NCBI Gene 173763], cox-10 (Protoheme IX farnesyltransferase, mitochondrial) [NCBI Gene 174899]
- **Diseases:** parasitic helminths (MESH:D010272), Infections (MESH:D007239), inherited mitochondrial disease (MESH:D030342)
- **Chemicals:** steroid (MESH:D013256), sphingolipid (MESH:D013107), ether lipid (-), glucuronate (MESH:D020723), unsaturated fatty acid (MESH:D005231), amino acid (MESH:D000596), fatty acid (MESH:D005227), carbohydrate (MESH:D002241), alpha-linolenic acid (MESH:D017962), arachidonic acid (MESH:D016718), copper (MESH:D003300), pyruvate (MESH:D019289), glycerophospholipid (MESH:D020404), pentose phosphate (MESH:D010428), tricarboxylic acid (MESH:D014233)
- **Species:** Trichinella britovi (species) [taxon 45882], Trichinella spiralis (species) [taxon 6334], Homo sapiens (human, species) [taxon 9606], C. elegans [taxon 328850], Echinococcus multilocularis (species) [taxon 6211], Caenorhabditis elegans (species) [taxon 6239]

## Figures

6 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12964723/full.md

---
Source: https://tomesphere.com/paper/PMC12964723