TL;DR
This paper introduces PHONI, a streaming algorithm for computing matching statistics against multi-genome references, enabling efficient, online analysis of large genomic patterns with low latency and manageable memory use.
Contribution
It simplifies and extends previous solutions to enable streaming computation of matching statistics for large genomic databases, supporting parallel and online processing.
Findings
Supports parallel processing of long patterns like entire human chromosomes
Enables online computation with low latency
Uses reasonable RAM despite large data
Abstract
Computing the matching statistics of patterns with respect to a text is a fundamental task in bioinformatics, but a formidable one when the text is a highly compressed genomic database. Bannai et al. gave an efficient solution for this case, which Rossi et al. recently implemented, but it uses two passes over the patterns and buffers a pointer for each character during the first pass. In this paper, we simplify their solution and make it streaming, at the cost of slowing it down slightly. This means that, first, we can compute the matching statistics of several long patterns (such as whole human chromosomes) in parallel while still using a reasonable amount of RAM; second, we can compute matching statistics online with low latency and thus quickly recognize when a pattern becomes incompressible relative to the database.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
