# CryoVirusDB: An Annotated Dataset for AI-Based Virus Particle Identification in Cryo-EM Micrographs

**Authors:** Rajan Gyawali, Ashwin Dhakal, Liguo Wang, Jianlin Cheng

PMC · DOI: 10.3390/v18020224 · Viruses · 2026-02-11

## TL;DR

CryoVirusDB is a new dataset of labeled virus particles in cryo-EM images to improve AI-based virus identification and 3D structure modeling.

## Contribution

The paper introduces CryoVirusDB, a manually annotated dataset for training AI to identify virus particles in cryo-EM micrographs.

## Key findings

- CryoVirusDB contains 9,941 micrographs with coordinates for 339,398 labeled virus particles.
- The dataset includes seven non-enveloped viruses with icosahedral or pseudo-icosahedral symmetry.
- The dataset supports AI and deep learning methods for virus particle identification in cryo-EM.

## Abstract

With the advancements in instrumentation, image processing algorithms, and computational capabilities, single-particle cryo-electron microscopy (cryo-EM) has achieved atomic resolution in determining the 3D structures of viruses. The virus structures play a crucial role in studying their biological function and advancing the development of antiviral vaccines and treatments. Despite the effectiveness of artificial intelligence (AI) in general image processing, its development for identifying and extracting virus particles from cryo-EM micrographs has been hindered by the lack of manually labeled high-quality datasets. To fill the gap, we introduce CryoVirusDB, a labeled dataset containing the coordinates of expert-picked virus particles in cryo-EM micrographs. CryoVirusDB comprises 9941 micrographs from nine datasets representing seven distinct non-enveloped viruses exhibiting icosahedral or pseudo-icosahedral symmetry, along with coordinates of 339,398 labeled virus particles. It can be used to train and test AI and machine learning (e.g., deep learning) methods to accurately identify virus particles in cryo-EM micrographs for building atomic 3D structural models for viruses.

## Full-text entities

- **Genes:** S (surface glycoprotein) [NCBI Gene 43740568] {aka spike glycoprotein}
- **Diseases:** ice (MESH:C535741), injury to (MESH:D014947), astigmatism (MESH:D001251), COVID-19 (MESH:D000086382)
- **Chemicals:** carbon (MESH:D002244), ice (MESH:D007053), Thon (-)
- **Species:** Human immunodeficiency virus 1 (no rank) [taxon 11676], Cowpea mosaic virus (no rank) [taxon 12264], Macrobrachium rosenbergii nodavirus (species) [taxon 222557], Severe acute respiratory syndrome coronavirus 2 (no rank) [taxon 2697049], Ebola virus (no rank) [taxon 1570291], Human parechovirus 3 (no rank) [taxon 195055], Coxsackievirus B4 (no rank) [taxon 12073], Enterovirus C (no rank) [taxon 138950], Homo sapiens (human, species) [taxon 9606], Nudaurelia capensis omega virus (no rank) [taxon 12541], Feline calicivirus (no rank) [taxon 11978], Viruses (acellular root) [taxon 10239], Enterovirus E (no rank) [taxon 12064]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12945220/full.md

## Figures

8 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12945220/full.md

## References

40 references — full list in the complete paper: https://tomesphere.com/paper/PMC12945220/full.md

---
Source: https://tomesphere.com/paper/PMC12945220