# Deletion detection in SARS-CoV-2 genomes from COVID-19 patients: elimination of false positives

**Authors:** Nan Jiang, Colin N Dewey, John Yin

PMC · DOI: 10.1093/ve/veag003 · Virus Evolution · 2026-02-02

## TL;DR

This paper addresses false-positive deletion detection in SARS-CoV-2 genomes from patient samples and introduces a filtering strategy to improve accuracy.

## Contribution

A novel filtering strategy that reduces false positives in deletion detection from SARS-CoV-2 sequencing data.

## Key findings

- The filtering strategy removed >99% of false positives in Illumina short-read data.
- Deletions outside transcription regulatory sequences were ~20-fold less common than previously reported.
- Deletions remain more frequent in symptomatic patients after filtering.

## Abstract

Deletions are prevalent in the genomes of SARS-CoV-2 isolates from COVID-19 patients, but their roles in the severity, transmission, and persistence of disease are poorly understood. Millions of COVID-19 swab samples from patients have been sequenced and made available online, offering an unprecedented opportunity to study such deletions. Multiplex-PCR sequencing (amplicon-seq) has been the most widely used method for sequencing clinical COVID-19 samples. However, through experiments with negative control samples and existing bioinformatics methods, we find that this protocol introduces large numbers of false-positive deletions. These false positives commonly occur in short alignments, at low frequency and depth, and near primer binding sites used for whole-genome amplification. To address this issue, we developed a filtering strategy, validated with positive control samples containing a known deletion. Our strategy accurately detected the known deletion and removed >99% of false positives in Illumina short-read data from ARTIC amplicon sequencing protocols. This method, applied to public COVID-19 swab data, revealed that deletions occurring independently of transcription regulatory sequences were ~20-fold less common than previously reported; however, they remain more frequent in symptomatic patients. Our optimized approach should enhance the reliability of SARS-CoV-2 deletion characterization from surveillance studies. Finally, our approach may guide the development of bioinformatics pipelines for genome sequence analyses of other viruses.

## Linked entities

- **Diseases:** COVID-19 (MONDO:0100096)

## Full-text entities

- **Diseases:** COVID-19 (MESH:D000086382)
- **Species:** Severe acute respiratory syndrome coronavirus 2 (no rank) [taxon 2697049], Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12900060/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12900060/full.md

## References

35 references — full list in the complete paper: https://tomesphere.com/paper/PMC12900060/full.md

---
Source: https://tomesphere.com/paper/PMC12900060