# Uncovering hundreds of exogenous and endogenous RNA viral RdRp sequences amongst uncharacterized sequences in public protein databases

**Authors:** Katherine Brown, Andrew Edwin Firth

PMC · DOI: 10.1093/ve/veaf074 · Virus Evolution · 2025-09-18

## TL;DR

This study finds many RNA viral and endogenous viral elements in public protein databases labeled as uncharacterized, revealing new viruses and highlighting database issues.

## Contribution

A novel method to identify RNA viral RdRp sequences in uncharacterized protein databases, uncovering hundreds of new endogenous and exogenous viral elements.

## Key findings

- 3560 uncharacterized sequences were identified as likely RNA viral RdRp, many from endogenous viral elements.
- New orbi-like viruses infecting nematodes and mito-like viruses in plants were discovered.
- Mislabeling of viral sequences as bacterial was observed, suggesting contamination or annotation errors.

## Abstract

Public databases of protein sequences, such as the National Center for Biotechnology Information (NCBI) Protein repository and UniProt, contain millions of proteins identified in samples from specific species but named as uncharacterized or hypothetical due to a lack of information about their function. Many such sequences are actually derived from RNA viruses, either due to viral infection of the original sample, contamination, or endogenous viral elements (EVEs) integrated into the genome of the sample species. Many proteins from RNA virus discovery research are also deposited into these repositories but are labelled as uncharacterized and only classified taxonomically at a superkingdom or realm level. Sequences from protein repositories not labelled specifically as being derived from the RNA-viral RNA-dependent RNA polymerase (RdRp) protein are often used as negative controls when looking to identify viral RdRp sequences, so the presence of unlabelled viruses amongst these datasets is problematic. These sequences also represent a source of information about novel viruses and EVEs. In this study, we screened uncharacterized proteins from two large public protein repositories—NCBI Protein and UniProt—to identify sequences likely to be derived from RNA viral RdRp and to perform detailed characterization of sequences of interest. We identified 3560 such sequences, many derived from EVEs. Many are previously unknown EVEs, which led to characterization of additional, related sequences. For example, a group of orbi-like viruses infecting nematodes was uncovered that appears to have both ancient endogenous and circulating exogenous members. Many integrations of mito-like viruses into plant genomes were also found. In several host taxonomic groups, the first example of an EVE, and in some cases the first example of any RNA virus, was uncovered. The large number of EVEs uncovered by this relatively small-scale search suggests that only a fraction of the true diversity of EVEs is currently known. We also provide provisional taxonomic annotations for RdRps, currently only listed as members of the Riboviria realm. A number of sequences are identified that are indistinguishable from viruses but are labelled as bacteria, seemingly as a result of mislabelling or contamination. Non-RdRp sequences that share near-significant similarity with RdRp are also characterized. Finally, recommendations are made for generating useful negative controls and sets of negative control sequences are provided.

## Linked entities

- **Proteins:** RNA-dependent RNA polymerase (RNA-dependent RNA polymerase), RdRP (RNA-directed RNA polymerase)
- **Species:** Nematodes (taxon 333870)

## Full-text entities

- **Genes:** riboflavin kinase [NCBI Gene 6096703]
- **Diseases:** hepatitis C (MESH:D019698), infection (MESH:D007239), influenza A (MESH:D007251), EVEs (MESH:D014777), fungal (MESH:D009181), filarial (MESH:D004605), filarial nematode (MESH:D009349), lymphatic filarial disease (MESH:D008206), dengue (MESH:D003715)
- **Chemicals:** nitrogen (MESH:D009584), EVEs (-), nucleotide (MESH:D009711), Lipid (MESH:D008055)
- **Species:** Canis lupus familiaris (dog, subspecies) [taxon 9615], Cardamine (bittercress, genus) [taxon 50460], Homo sapiens (human, species) [taxon 9606], Wuchereria bancrofti (agent of lymphatic filariasis, species) [taxon 6293], Felis catus (cat, species) [taxon 9685], Euphydryas editha (Edith's checkerspot, species) [taxon 104508], Isopteran arli-related virus OKIAV103 (no rank) [taxon 2746356], Onchocerca ochengi (species) [taxon 42157], Curvularia thermal tolerance virus (no rank) [taxon 421976], Barbarea (winter cress, genus) [taxon 50457], Brugia malayi (agent of lymphatic filariasis, species) [taxon 6279], Arabidopsis thaliana (mouse-ear cress, species) [taxon 3702], Escherichia coli (E. coli, species) [taxon 562], Hepatitis C virus [taxon 11103], Cicer arietinum (chickpea, species) [taxon 3827], Perkinsozoa (phylum) [taxon 2497438], Methanobacterium (genus) [taxon 2160], Brugia (genus) [taxon 6278], Nematoda (nematode, phylum) [taxon 6231], Getah virus (no rank) [taxon 59300], Filarioidea (superfamily) [taxon 6295], Perkinsus olseni (species) [taxon 32597], Streptococcus (genus) [taxon 1301], Raphanus (genus) [taxon 3725], Duamitovirus oxru1 (species) [taxon 2955798], Haemonchus contortus (barber pole worm, species) [taxon 6289], Diaphorina citri (Asian citrus psyllid, species) [taxon 121845], Alphainfluenzavirus (genus) [taxon 197911], Bacillus (genus) [taxon 55087], Cytorhabdovirus [taxon 11305], Bacillus sp. T (species) [taxon 1071724], Barbus barbus (barbel, species) [taxon 40830], Oxybasis rubra mitovirus 1 (no rank) [taxon 2080462], Orbivirus (genus) [taxon 10892], Nodaviridae (family) [taxon 12283], Perkinsus chesapeaki (species) [taxon 330153], Pomphorhynchus laevis (species) [taxon 141832], Copasivirus ivindoense (species) [taxon 2955720], Klebsiella (genus) [taxon 570], Influenza A virus (no rank) [taxon 11320], Crucihimalaya (genus) [taxon 97990], Azolla filiculoides (species) [taxon 84609], Mischocyttarus mexicanus (species) [taxon 91405], Brugia pahangi (species) [taxon 6280], Hubei lepidoptera virus 4 (species) [taxon 1922906], Rotavirus A (no rank) [taxon 28875], Picornavirales (order) [taxon 464095], Bos taurus (bovine, species) [taxon 9913], Brassica (genus) [taxon 3705], Brugia timori (species) [taxon 42155], Potato virus X (no rank) [taxon 12183], Polistes fuscatus (common paper wasp, species) [taxon 30207]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12548735/full.md

## Figures

7 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12548735/full.md

## References

83 references — full list in the complete paper: https://tomesphere.com/paper/PMC12548735/full.md

---
Source: https://tomesphere.com/paper/PMC12548735