Metric $k$-clustering using only Weak Comparison Oracles
Rahul Raychaudhury, Aryan Esmailpour, Sainyam Galhotra, Stavros Sintos

TL;DR
This paper introduces randomized algorithms for $k$-clustering that operate solely on weak comparison oracles, replacing exact distance queries with noisy relative comparisons, and achieves near-optimal clustering with low query complexity.
Contribution
It develops the first scalable clustering algorithms that use only noisy quadruplet comparison oracles, extending applicability to scenarios with limited or noisy distance information.
Findings
Achieves constant-factor approximation with $O(nk ext{polylog}(n))$ queries.
Improves to $(1+ ext{small }\varepsilon)$-approximation for bounded doubling dimension metrics.
Demonstrates integration of noisy, low-cost oracles like language models into clustering algorithms.
Abstract
Clustering is a fundamental primitive in unsupervised learning. However, classical algorithms for -clustering (such as -median and -means) assume access to exact pairwise distances -- an unrealistic requirement in many modern applications. We study clustering in the \emph{Rank-model (R-model)}, where access to distances is entirely replaced by a \emph{quadruplet oracle} that provides only relative distance comparisons. In practice, such an oracle can represent learned models or human feedback, and is expected to be noisy and entail an access cost. Given a metric space with input items, we design randomized algorithms that, using only a noisy quadruplet oracle, compute a set of centers along with a mapping from the input items to the centers such that the clustering cost of the mapping is at most constant times the optimum -clustering…
Peer Reviews
Decision·ICLR 2026 Poster
The problem setting is very natural, for cases where distance computation might be hard or expensivem, performing comparisions using machine learning models might be efficient. The proofs are clear to the best of my knowledge. This work removes the requirement of distance oracle and provides coreset construction with only the oracle queries. The results obtained are near-optimal.
The experimental setting is very limited, with them being run on only one synthetic dataset. But the algorithmic contributions and results obtained are non-trivial and significant contribution so this is not really a major concern for me. Maybe a minor typo - probsort (in Lemma 2.1) requires noise parameter $\leq 1/4$ but appendix A says $\leq 1/2$.
Overall, I like this paper. It follows the recent line of work for weak-strong oracle models. However, different from existing results that almost exclusively showed the necessity of the strong oracle, this paper is the first to show that we could do something without the strong oracle at all (as far as I know). I think this is a nice conceptual contribution. The paper is well-written, and although I did not get the time to check all the steps, the technical overview is easy to follow. Therefore
I do not see any major weakness in the paper. One thing I want the author to emphasize is that one can know the coreset without knowing the clustering, since I was confused for a moment about whether there is a contradiction with GRS [PODS’24]. Some of the technical overview is a bit wordy and dense with math. I understand it’s a bit hard to present it in a cleaner manner. Maybe you can add some figures for the guard and kernel sets, plus the filtering process. I believe that’ll help readers u
The paper is well-written and easy to follow. The weak-oracle model is well motivated, and oracle-based modes in general have gained significant attention in the last few years.
While it is clear why ALG-DI required bounded doubling dimension, it is not entirely clear why ALG-D cannot be used in the general setting Minor technical comments (see Detailed technical overview)
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Clustering Algorithms Research · Stochastic Gradient Optimization Techniques · Facility Location and Emergency Management
