# RUMINA: high-throughput deduplication of unique molecular identifiers for amplicon and whole-genome sequencing with enhanced error correction

**Authors:** Eli Piliper, Stephanie Goya, Alexander L Greninger

PMC · DOI: 10.1093/bioinformatics/btag097 · 2026-02-24

## TL;DR

RUMINA is a fast and accurate tool for processing sequencing data using unique molecular identifiers to improve error correction and detect rare genetic variations.

## Contribution

RUMINA introduces a high-performance UMI deduplication pipeline with enhanced error correction and improved detection of ultra-low frequency variants.

## Key findings

- RUMINA improves detection of ultra-low frequency SNVs (0.01%–1%) in sequencing data.
- The tool reduces false positives and increases reproducibility compared to existing methods.
- RUMINA processes data up to 10-fold faster than other UMI deduplication tools.

## Abstract

Unique molecular identifiers (UMIs) are widely used in next-generation sequencing to enable accurate molecular counting and error correction. However, challenges remain in accurately collapsing UMI clusters, especially when read counts are low or sparse read clusters arise from barcode sequencing errors.

We present RUMINA, a Rust-based pipeline for UMI-aware deduplication and error correction, optimized for both amplicon and shotgun sequencing. RUMINA supports multiple UMI cluster strategies, alongside majority-rule read selection independent of mapping quality, as well as discrete handling of 1–2 read clusters, paired-end merging, and read-length stratification. Benchmarking using simulated HIV population sequencing data and real-world iCLIP and TCR datasets showed that RUMINA improves ultra-low frequency SNV detection (0.01%–1%), reduces false positives, enhances reproducibility, and processes sequencing data up to 10-fold faster than existing tools. By integrating UMI- and sequence-level correction in a high-performance framework, RUMINA offers a fast, scalable, and robust solution for UMI-enabled sequencing workflows.

RUMINA is implemented in Rust and distributed as open-source code and precompiled binaries. Source code and installation instructions are available at https://github.com/greninger-lab/rumina. Documentation associated with this manuscript is available at https://github.com/greninger-lab/rumina_paper.

## Full-text entities

- **Genes:** TRAJ60 (T cell receptor alpha joining 60 (pseudogene)) [NCBI Gene 28695] {aka TCRA}, TRBV20OR9-2 (T cell receptor beta variable 20/OR9-2 (non-functional)) [NCBI Gene 6962] {aka CDR3, TCRBV20S2, TCRBV2O, TCRBV2S2O}
- **Diseases:** cancer (MESH:D009369)
- **Chemicals:** RUMINA (-)
- **Species:** Human immunodeficiency virus 1 (no rank) [taxon 11676]

## Figures

1 figure with captions in the complete paper: https://tomesphere.com/paper/PMC12975283/full.md

---
Source: https://tomesphere.com/paper/PMC12975283