TL;DR
This paper introduces a generalized locality-sensitive hashing method for sequence bucketing that improves sensitivity in high-error bioinformatics data, providing theoretical analysis and optimal constructions.
Contribution
It develops a new class of locality-sensitive bucketing functions allowing sequences to be mapped into multiple buckets, with theoretical bounds and optimality proofs.
Findings
Constructed LSB functions for various sensitivity parameters.
Analyzed efficiency in terms of bucket count and sequence mappings.
Proved lower bounds and optimality of some LSB functions.
Abstract
Many bioinformatics applications involve bucketing a set of sequences where each sequence is allowed to be assigned into multiple buckets. To achieve both high sensitivity and precision, bucketing methods are desired to assign similar sequences into the same bucket while assigning dissimilar sequences into distinct buckets. Existing -mer-based bucketing methods have been efficient in processing sequencing data with low error rate, but encounter much reduced sensitivity on data with high error rate. Locality-sensitive hashing (LSH) schemes are able to mitigate this issue through tolerating the edits in similar sequences, but state-of-the-art methods still have large gaps. Here we generalize the LSH function by allowing it to hash one sequence into multiple buckets. Formally, a bucketing function, which maps a sequence (of fixed length) into a subset of buckets, is defined to be $(d_1,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
