RUMINA: high-throughput deduplication of unique molecular identifiers for amplicon and whole-genome sequencing with enhanced error correction
Eli Piliper, Stephanie Goya, Alexander L Greninger

TL;DR
RUMINA is a fast and accurate tool for processing sequencing data using unique molecular identifiers to improve error correction and detect rare genetic variations.
Contribution
RUMINA introduces a high-performance UMI deduplication pipeline with enhanced error correction and improved detection of ultra-low frequency variants.
Findings
RUMINA improves detection of ultra-low frequency SNVs (0.01%–1%) in sequencing data.
The tool reduces false positives and increases reproducibility compared to existing methods.
RUMINA processes data up to 10-fold faster than other UMI deduplication tools.
Abstract
Unique molecular identifiers (UMIs) are widely used in next-generation sequencing to enable accurate molecular counting and error correction. However, challenges remain in accurately collapsing UMI clusters, especially when read counts are low or sparse read clusters arise from barcode sequencing errors. We present RUMINA, a Rust-based pipeline for UMI-aware deduplication and error correction, optimized for both amplicon and shotgun sequencing. RUMINA supports multiple UMI cluster strategies, alongside majority-rule read selection independent of mapping quality, as well as discrete handling of 1–2 read clusters, paired-end merging, and read-length stratification. Benchmarking using simulated HIV population sequencing data and real-world iCLIP and TCR datasets showed that RUMINA improves ultra-low frequency SNV detection (0.01%–1%), reduces false positives, enhances reproducibility, and…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Phylogenetic Studies · Advanced Proteomics Techniques and Applications · Cancer Genomics and Diagnostics
