SOPanG 2: online searching over a pan-genome without false positives
Aleksander Cis{\l}ak, Szymon Grabowski

TL;DR
SOPanG 2 introduces an efficient method for accurate online searching over pan-genomes stored as elastic-degenerate strings, enabling precise identification of true positive matches mapped to individuals with minimal speed penalty.
Contribution
It extends the SOPanG tool to report only true positive matches in pan-genomes, improving accuracy without significantly reducing search speed.
Findings
Achieves over 430 MB/s throughput on real data
Adds less than 3.5% speed penalty for true positive verification
Successfully maps pattern matches onto individual genomes
Abstract
Motivation: The pan-genome can be stored as elastic-degenerate (ED) string, a recently introduced compact representation of multiple overlapping sequences. However, a search over the ED string does not indicate which individuals (if any) match the entire query. Results: We augment the ED string with sources (individuals' indexes) and propose an extension of the SOPanG (Shift-Or for Pan-Genome) tool to report only true positive matches, omitting those not occurring in any of the haplotypes. The additional stage for checking the matches yields a penalty of less than 3.5% relative speed in practice, which means that SOPanG 2 is able to report pattern matches in a pan-genome, mapping them onto individuals, at the single-thread throughput of above 430 MB/s on real data. Availability and implementation: SOPanG 2 can be downloaded here: github.com/MrAlexSee/sopang
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Phylogenetic Studies · Machine Learning in Bioinformatics · Algorithms and Data Compression
