Smooth $q$-Gram, and Its Applications to Detection of Overlaps among Long, Error-Prone Sequencing Reads
Haoyu Zhang, Qin Zhang, Haixu Tang

TL;DR
This paper introduces smooth q-gram, a novel variant of q-gram that detects pairs within small edit distances, significantly improving overlap detection accuracy in error-prone long sequencing reads.
Contribution
The paper presents the first smooth q-gram variant and demonstrates its effectiveness in accurately detecting overlaps among error-prone long reads from SMRT sequencing.
Findings
Significant accuracy improvement over existing q-gram methods
Effective in detecting overlaps in error-prone long reads
Validated on real-world sequencing benchmarks
Abstract
We propose smooth -gram, the first variant of -gram that captures -gram pair within a small edit distance. We apply smooth -gram to the problem of detecting overlapping pairs of error-prone reads produced by single molecule real time sequencing (SMRT), which is the first and most critical step of the de novo fragment assembly of SMRT reads. We have implemented and tested our algorithm on a set of real world benchmarks. Our empirical results demonstrated the significant superiority of our algorithm over the existing -gram based algorithms in accuracy.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Genomics and Phylogenetic Studies · Natural Language Processing Techniques
