# conLSH: Context based Locality Sensitive Hashing for Mapping of noisy   SMRT Reads

**Authors:** Angana Chakraborty, Sanghamitra Bandyopadhyay

arXiv: 1903.04925 · 2019-03-13

## TL;DR

This paper introduces conLSH, a novel context-based Locality Sensitive Hashing algorithm designed to efficiently align noisy SMRT sequencing reads to reference genomes, outperforming existing methods in speed and memory usage.

## Contribution

The paper presents a new contextual LSH algorithm for noisy read alignment, improving speed and memory efficiency over existing methods like rHAT.

## Key findings

- conLSH reduces processing time by approximately 24.2%.
- conLSH decreases peak memory usage by about 70.3%.
- The algorithm effectively aligns noisy SMRT reads to reference genomes.

## Abstract

Single Molecule Real-Time (SMRT) sequencing is a recent advancement of Next Gen technology developed by Pacific Bio (PacBio). It comes with an explosion of long and noisy reads demanding cutting edge research to get most out of it. To deal with the high error probability of SMRT data, a novel contextual Locality Sensitive Hashing (conLSH) based algorithm is proposed in this article, which can effectively align the noisy SMRT reads to the reference genome. Here, sequences are hashed together based not only on their closeness, but also on similarity of context. The algorithm has $\mathcal{O}(n^{\rho+1})$ space requirement, where $n$ is the number of sequences in the corpus and $\rho$ is a constant. The indexing time and querying time are bounded by $\mathcal{O}( \frac{n^{\rho+1} \cdot \ln n}{\ln \frac{1}{P_2}})$ and $\mathcal{O}(n^\rho)$ respectively, where $P_2 > 0$, is a probability value. This algorithm is particularly useful for retrieving similar sequences, a widely used task in biology. The proposed conLSH based aligner is compared with rHAT, popularly used for aligning SMRT reads, and is found to comprehensively beat it in speed as well as in memory requirements. In particular, it takes approximately $24.2\%$ less processing time, while saving about $70.3\%$ in peak memory requirement for H.sapiens PacBio dataset.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1903.04925/full.md

## Figures

11 figures with captions in the complete paper: https://tomesphere.com/paper/1903.04925/full.md

## References

30 references — full list in the complete paper: https://tomesphere.com/paper/1903.04925/full.md

---
Source: https://tomesphere.com/paper/1903.04925