Sketching and Sequence Alignment: A Rate-Distortion Perspective
Ilan Shomorony, Govinda M. Kamath

TL;DR
This paper introduces a rate-distortion framework for DNA read sketching, proposing a new locational hashing algorithm that reduces sketch size requirements and enhances computational efficiency in pairwise alignment tasks.
Contribution
It develops a novel rate-distortion approach and a new locational hashing algorithm that significantly reduces sketch size compared to standard methods.
Findings
Locational hashing requires fewer bits than min-hash for the same distortion.
The proposed method achieves a logarithmic squared dependence on inverse distortion.
Significant computational savings are possible in DNA sequence alignment.
Abstract
Pairwise alignment of DNA sequencing data is a ubiquitous task in bioinformatics and typically represents a heavy computational burden. A standard approach to speed up this task is to compute "sketches" of the DNA reads (typically via hashing-based techniques) that allow the efficient computation of pairwise alignment scores. We propose a rate-distortion framework to study the problem of computing sketches that achieve the optimal tradeoff between sketch size and alignment estimation distortion. We consider the simple setting of i.i.d. error-free sources of length and introduce a new sketching algorithm called "locational hashing." While standard approaches in the literature based on min-hashes require bits to achieve a distortion , our proposed approach only requires bits. This can lead to significant…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
