Locality-sensitive bucketing functions for the edit distance

Ke Chen; Mingfu Shao

arXiv:2206.03097·cs.DS·June 27, 2022

Locality-sensitive bucketing functions for the edit distance

Ke Chen, Mingfu Shao

PDF

1 Repo

TL;DR

This paper introduces a generalized locality-sensitive hashing method for sequence bucketing that improves sensitivity in high-error bioinformatics data, providing theoretical analysis and optimal constructions.

Contribution

It develops a new class of locality-sensitive bucketing functions allowing sequences to be mapped into multiple buckets, with theoretical bounds and optimality proofs.

Findings

01

Constructed LSB functions for various sensitivity parameters.

02

Analyzed efficiency in terms of bucket count and sequence mappings.

03

Proved lower bounds and optimality of some LSB functions.

Abstract

Many bioinformatics applications involve bucketing a set of sequences where each sequence is allowed to be assigned into multiple buckets. To achieve both high sensitivity and precision, bucketing methods are desired to assign similar sequences into the same bucket while assigning dissimilar sequences into distinct buckets. Existing $k$ -mer-based bucketing methods have been efficient in processing sequencing data with low error rate, but encounter much reduced sensitivity on data with high error rate. Locality-sensitive hashing (LSH) schemes are able to mitigate this issue through tolerating the edits in similar sequences, but state-of-the-art methods still have large gaps. Here we generalize the LSH function by allowing it to hash one sequence into multiple buckets. Formally, a bucketing function, which maps a sequence (of fixed length) into a subset of buckets, is defined to be $(d_1,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Shao-Group/lsbucketing
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.