Clustering Protein Sequences Given the Approximation Stability of the Min-Sum Objective Function
Konstantin Voevodski, Maria-Florina Balcan, Heiko Roglin, Shang-Hua, Teng, Yu Xia

TL;DR
This paper introduces an efficient clustering algorithm for protein sequences that operates with limited distance information, leveraging approximation stability of the min-sum objective, and demonstrates its effectiveness against standard methods.
Contribution
The paper presents a novel clustering algorithm that uses few one-versus-all distance queries based on approximation stability assumptions, improving efficiency in protein sequence clustering.
Findings
Algorithm achieves accurate clustering with few queries.
Method outperforms established algorithms in empirical tests.
Effective in limited information settings for protein data.
Abstract
We study the problem of efficiently clustering protein sequences in a limited information setting. We assume that we do not know the distances between the sequences in advance, and must query them during the execution of the algorithm. Our goal is to find an accurate clustering using few queries. We model the problem as a point set with an unknown metric on , and assume that we have access to \emph{one versus all} distance queries that given a point return the distances between and all other points. Our one versus all query represents an efficient sequence database search program such as BLAST, which compares an input sequence to an entire data set. Given a natural assumption about the approximation stability of the \emph{min-sum} objective function for clustering, we design a provably accurate clustering algorithm that uses few one versus all queries. In our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Artificial Immune Systems Applications · Bayesian Methods and Mixture Models
