Efficient Clustering with Limited Distance Information
Konstantin Voevodski, Maria-Florina Balcan, Heiko Roglin, Shang-Hua, Teng, Yu Xia

TL;DR
This paper introduces an efficient clustering method that requires only a small number of distance queries, leveraging one-vs-all queries, and demonstrates its effectiveness in protein sequence clustering with minimal distance information.
Contribution
The paper presents a novel algorithm for clustering with limited distance queries, achieving accurate results under certain structural assumptions, and applies it to protein sequence data.
Findings
Achieves accurate clustering with only O(k) distance queries
Effective in protein sequence clustering using sequence database searches
Produces clusterings close to manual classifications despite limited data
Abstract
Given a point set S and an unknown metric d on S, we study the problem of efficiently partitioning S into k clusters while querying few distances between the points. In our model we assume that we have access to one versus all queries that given a point s 2 S return the distances between s and all other points. We show that given a natural assumption about the structure of the instance, we can efficiently find an accurate clustering using only O(k) distance queries. We use our algorithm to cluster proteins by sequence similarity. This setting nicely fits our model because we can use a fast sequence database search program to query a sequence against an entire dataset. We conduct an empirical study that shows that even though we query a small fraction of the distances between the points, we produce clusterings that are close to a desired clustering given by manual classification.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Clustering Algorithms Research · Data Management and Algorithms · Algorithms and Data Compression
