Distortion-Resistant Hashing for rapid search of similar DNA subsequence

Jarek Duda

arXiv:1602.05889·cs.DS·February 19, 2016

Distortion-Resistant Hashing for rapid search of similar DNA subsequence

Jarek Duda

PDF

Open Access

TL;DR

This paper introduces Distortion-Resistant Hashing (DRH), a novel method for rapidly identifying similar DNA subsequences by generating fingerprints tolerant to small sequence variations, based on rate distortion theory.

Contribution

It presents a new hashing approach that tolerates small distortions, enabling efficient search for similar DNA sequences in bioinformatics.

Findings

01

DRH effectively identifies similar sequences despite small variations.

02

The method leverages rate distortion theory for hash construction.

03

Potential for faster, more tolerant sequence matching in genomics.

Abstract

One of the basic tasks in bioinformatics is localizing a short subsequence $S$ , read while sequencing, in a long reference sequence $R$ , like the human geneome. A natural rapid approach would be finding a hash value for $S$ and compare it with a prepared database of hash values for each of length $∣ S ∣$ subsequences of $R$ . The problem with such approach is that it would only spot a perfect match, while in reality there are lots of small changes: substitutions, deletions and insertions. This issue could be repaired if having a hash function designed to tolerate some small distortion accordingly to an alignment metric (like Needleman-Wunch): designed to make that two similar sequences should most likely give the same hash value. This paper discusses construction of Distortion-Resistant Hashing (DRH) to generate such fingerprints for rapid search of similar subsequences. The proposed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Genomics and Phylogenetic Studies · DNA and Biological Computing